Automatic summarization of cooking videos using transfer learning and transformer-based models

Abstract The proliferation of cooking videos on the internet these days necessitates the conversion of these lengthy video contents into concise text recipes. Many online platforms now have a large number of cooking videos, in which, there is a challenge for viewers to extract comprehensive recipes...

Full description

Saved in:
Bibliographic Details
Main Authors: P. M. Alen Sadique, R. V. Aswiga
Format: Article
Language:English
Published: Springer 2025-01-01
Series:Discover Artificial Intelligence
Subjects:
Online Access:https://doi.org/10.1007/s44163-025-00230-y
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832585609702539264
author P. M. Alen Sadique
R. V. Aswiga
author_facet P. M. Alen Sadique
R. V. Aswiga
author_sort P. M. Alen Sadique
collection DOAJ
description Abstract The proliferation of cooking videos on the internet these days necessitates the conversion of these lengthy video contents into concise text recipes. Many online platforms now have a large number of cooking videos, in which, there is a challenge for viewers to extract comprehensive recipes from lengthy visual content. Effective summary is necessary in order to translate the abundance of culinary knowledge found in videos into text recipes that are easy to read and follow. This will make the cooking process easier for individuals who are searching for precise step by step cooking instructions. Such a system satisfies the needs of a broad spectrum of learners while also improving accessibility and user simplicity. As there is a growing need for easy-to-follow recipes made from cooking videos, researchers are looking on the process of automated summarization using advanced techniques. One such approach is presented in our work, which combines simple image-based models, audio processing, and GPT-based models to create a system that makes it easier to turn long culinary videos into in-depth recipe texts. A systematic workflow is adopted in order to achieve the objective. Initially, Focus is given for frame summary generation which employs a combination of two convolutional neural networks and a GPT-based model. A pre-trained CNN model called Inception-V3 is fine-tuned with food image dataset for dish recognition and another custom-made CNN is built with ingredient images for ingredient recognition. Then a GPT based model is used to combine the results produced by the two CNN models which will give us the frame summary in the desired format. Subsequently, Audio summary generation is tackled by performing Speech-to-text functionality in python. A GPT-based model is then used to generate a summary of the resulting textual representation of audio in our desired format. Finally, to refine the summaries obtained from visual and auditory content, Another GPT-based model is used which combines the output of the frame summary and audio summary modules and give the final enhanced summary. By minimizing the complications involved with traditional and sophisticated methodologies, this research helps with the development of a straightforward but efficient cooking video summarization system. The results achieved in the work are on par with the existing work in the respective field which demonstrates comparable performance and efficacy in converting cooking videos into detailed recipe texts.
format Article
id doaj-art-fd8e75f9522c42a9890c0e35b880dc5e
institution Kabale University
issn 2731-0809
language English
publishDate 2025-01-01
publisher Springer
record_format Article
series Discover Artificial Intelligence
spelling doaj-art-fd8e75f9522c42a9890c0e35b880dc5e2025-01-26T12:43:02ZengSpringerDiscover Artificial Intelligence2731-08092025-01-015112010.1007/s44163-025-00230-yAutomatic summarization of cooking videos using transfer learning and transformer-based modelsP. M. Alen Sadique0R. V. Aswiga1School of Computer Science and Engineering, Vellore Institute of TechnologySchool of Computer Science and Engineering, Vellore Institute of TechnologyAbstract The proliferation of cooking videos on the internet these days necessitates the conversion of these lengthy video contents into concise text recipes. Many online platforms now have a large number of cooking videos, in which, there is a challenge for viewers to extract comprehensive recipes from lengthy visual content. Effective summary is necessary in order to translate the abundance of culinary knowledge found in videos into text recipes that are easy to read and follow. This will make the cooking process easier for individuals who are searching for precise step by step cooking instructions. Such a system satisfies the needs of a broad spectrum of learners while also improving accessibility and user simplicity. As there is a growing need for easy-to-follow recipes made from cooking videos, researchers are looking on the process of automated summarization using advanced techniques. One such approach is presented in our work, which combines simple image-based models, audio processing, and GPT-based models to create a system that makes it easier to turn long culinary videos into in-depth recipe texts. A systematic workflow is adopted in order to achieve the objective. Initially, Focus is given for frame summary generation which employs a combination of two convolutional neural networks and a GPT-based model. A pre-trained CNN model called Inception-V3 is fine-tuned with food image dataset for dish recognition and another custom-made CNN is built with ingredient images for ingredient recognition. Then a GPT based model is used to combine the results produced by the two CNN models which will give us the frame summary in the desired format. Subsequently, Audio summary generation is tackled by performing Speech-to-text functionality in python. A GPT-based model is then used to generate a summary of the resulting textual representation of audio in our desired format. Finally, to refine the summaries obtained from visual and auditory content, Another GPT-based model is used which combines the output of the frame summary and audio summary modules and give the final enhanced summary. By minimizing the complications involved with traditional and sophisticated methodologies, this research helps with the development of a straightforward but efficient cooking video summarization system. The results achieved in the work are on par with the existing work in the respective field which demonstrates comparable performance and efficacy in converting cooking videos into detailed recipe texts.https://doi.org/10.1007/s44163-025-00230-yAutomated summarizationTransfer learningComputer visionNatural language processingSpeech recognitionConvolutional neural network
spellingShingle P. M. Alen Sadique
R. V. Aswiga
Automatic summarization of cooking videos using transfer learning and transformer-based models
Discover Artificial Intelligence
Automated summarization
Transfer learning
Computer vision
Natural language processing
Speech recognition
Convolutional neural network
title Automatic summarization of cooking videos using transfer learning and transformer-based models
title_full Automatic summarization of cooking videos using transfer learning and transformer-based models
title_fullStr Automatic summarization of cooking videos using transfer learning and transformer-based models
title_full_unstemmed Automatic summarization of cooking videos using transfer learning and transformer-based models
title_short Automatic summarization of cooking videos using transfer learning and transformer-based models
title_sort automatic summarization of cooking videos using transfer learning and transformer based models
topic Automated summarization
Transfer learning
Computer vision
Natural language processing
Speech recognition
Convolutional neural network
url https://doi.org/10.1007/s44163-025-00230-y
work_keys_str_mv AT pmalensadique automaticsummarizationofcookingvideosusingtransferlearningandtransformerbasedmodels
AT rvaswiga automaticsummarizationofcookingvideosusingtransferlearningandtransformerbasedmodels