Automatic summarization of cooking videos using transfer learning and transformer-based models

Abstract The proliferation of cooking videos on the internet these days necessitates the conversion of these lengthy video contents into concise text recipes. Many online platforms now have a large number of cooking videos, in which, there is a challenge for viewers to extract comprehensive recipes...

Full description

Saved in:

Bibliographic Details
Main Authors:	P. M. Alen Sadique, R. V. Aswiga
Format:	Article
Language:	English
Published:	Springer 2025-01-01
Series:	Discover Artificial Intelligence
Subjects:	Automated summarization Transfer learning Computer vision Natural language processing Speech recognition Convolutional neural network
Online Access:	https://doi.org/10.1007/s44163-025-00230-y
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832585609702539264
author	P. M. Alen Sadique R. V. Aswiga
author_facet	P. M. Alen Sadique R. V. Aswiga
author_sort	P. M. Alen Sadique
collection	DOAJ
description	Abstract The proliferation of cooking videos on the internet these days necessitates the conversion of these lengthy video contents into concise text recipes. Many online platforms now have a large number of cooking videos, in which, there is a challenge for viewers to extract comprehensive recipes from lengthy visual content. Effective summary is necessary in order to translate the abundance of culinary knowledge found in videos into text recipes that are easy to read and follow. This will make the cooking process easier for individuals who are searching for precise step by step cooking instructions. Such a system satisfies the needs of a broad spectrum of learners while also improving accessibility and user simplicity. As there is a growing need for easy-to-follow recipes made from cooking videos, researchers are looking on the process of automated summarization using advanced techniques. One such approach is presented in our work, which combines simple image-based models, audio processing, and GPT-based models to create a system that makes it easier to turn long culinary videos into in-depth recipe texts. A systematic workflow is adopted in order to achieve the objective. Initially, Focus is given for frame summary generation which employs a combination of two convolutional neural networks and a GPT-based model. A pre-trained CNN model called Inception-V3 is fine-tuned with food image dataset for dish recognition and another custom-made CNN is built with ingredient images for ingredient recognition. Then a GPT based model is used to combine the results produced by the two CNN models which will give us the frame summary in the desired format. Subsequently, Audio summary generation is tackled by performing Speech-to-text functionality in python. A GPT-based model is then used to generate a summary of the resulting textual representation of audio in our desired format. Finally, to refine the summaries obtained from visual and auditory content, Another GPT-based model is used which combines the output of the frame summary and audio summary modules and give the final enhanced summary. By minimizing the complications involved with traditional and sophisticated methodologies, this research helps with the development of a straightforward but efficient cooking video summarization system. The results achieved in the work are on par with the existing work in the respective field which demonstrates comparable performance and efficacy in converting cooking videos into detailed recipe texts.
format	Article
id	doaj-art-fd8e75f9522c42a9890c0e35b880dc5e
institution	Kabale University
issn	2731-0809
language	English
publishDate	2025-01-01
publisher	Springer
record_format	Article
series	Discover Artificial Intelligence
spelling	doaj-art-fd8e75f9522c42a9890c0e35b880dc5e2025-01-26T12:43:02ZengSpringerDiscover Artificial Intelligence2731-08092025-01-015112010.1007/s44163-025-00230-yAutomatic summarization of cooking videos using transfer learning and transformer-based modelsP. M. Alen Sadique0R. V. Aswiga1School of Computer Science and Engineering, Vellore Institute of TechnologySchool of Computer Science and Engineering, Vellore Institute of TechnologyAbstract The proliferation of cooking videos on the internet these days necessitates the conversion of these lengthy video contents into concise text recipes. Many online platforms now have a large number of cooking videos, in which, there is a challenge for viewers to extract comprehensive recipes from lengthy visual content. Effective summary is necessary in order to translate the abundance of culinary knowledge found in videos into text recipes that are easy to read and follow. This will make the cooking process easier for individuals who are searching for precise step by step cooking instructions. Such a system satisfies the needs of a broad spectrum of learners while also improving accessibility and user simplicity. As there is a growing need for easy-to-follow recipes made from cooking videos, researchers are looking on the process of automated summarization using advanced techniques. One such approach is presented in our work, which combines simple image-based models, audio processing, and GPT-based models to create a system that makes it easier to turn long culinary videos into in-depth recipe texts. A systematic workflow is adopted in order to achieve the objective. Initially, Focus is given for frame summary generation which employs a combination of two convolutional neural networks and a GPT-based model. A pre-trained CNN model called Inception-V3 is fine-tuned with food image dataset for dish recognition and another custom-made CNN is built with ingredient images for ingredient recognition. Then a GPT based model is used to combine the results produced by the two CNN models which will give us the frame summary in the desired format. Subsequently, Audio summary generation is tackled by performing Speech-to-text functionality in python. A GPT-based model is then used to generate a summary of the resulting textual representation of audio in our desired format. Finally, to refine the summaries obtained from visual and auditory content, Another GPT-based model is used which combines the output of the frame summary and audio summary modules and give the final enhanced summary. By minimizing the complications involved with traditional and sophisticated methodologies, this research helps with the development of a straightforward but efficient cooking video summarization system. The results achieved in the work are on par with the existing work in the respective field which demonstrates comparable performance and efficacy in converting cooking videos into detailed recipe texts.https://doi.org/10.1007/s44163-025-00230-yAutomated summarizationTransfer learningComputer visionNatural language processingSpeech recognitionConvolutional neural network
spellingShingle	P. M. Alen Sadique R. V. Aswiga Automatic summarization of cooking videos using transfer learning and transformer-based models Discover Artificial Intelligence Automated summarization Transfer learning Computer vision Natural language processing Speech recognition Convolutional neural network
title	Automatic summarization of cooking videos using transfer learning and transformer-based models
title_full	Automatic summarization of cooking videos using transfer learning and transformer-based models
title_fullStr	Automatic summarization of cooking videos using transfer learning and transformer-based models
title_full_unstemmed	Automatic summarization of cooking videos using transfer learning and transformer-based models
title_short	Automatic summarization of cooking videos using transfer learning and transformer-based models
title_sort	automatic summarization of cooking videos using transfer learning and transformer based models
topic	Automated summarization Transfer learning Computer vision Natural language processing Speech recognition Convolutional neural network
url	https://doi.org/10.1007/s44163-025-00230-y
work_keys_str_mv	AT pmalensadique automaticsummarizationofcookingvideosusingtransferlearningandtransformerbasedmodels AT rvaswiga automaticsummarizationofcookingvideosusingtransferlearningandtransformerbasedmodels

Automatic summarization of cooking videos using transfer learning and transformer-based models

Similar Items