Automatic summarization of cooking videos using transfer learning and transformer-based models
Abstract The proliferation of cooking videos on the internet these days necessitates the conversion of these lengthy video contents into concise text recipes. Many online platforms now have a large number of cooking videos, in which, there is a challenge for viewers to extract comprehensive recipes...
Saved in:
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Springer
2025-01-01
|
Series: | Discover Artificial Intelligence |
Subjects: | |
Online Access: | https://doi.org/10.1007/s44163-025-00230-y |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832585609702539264 |
---|---|
author | P. M. Alen Sadique R. V. Aswiga |
author_facet | P. M. Alen Sadique R. V. Aswiga |
author_sort | P. M. Alen Sadique |
collection | DOAJ |
description | Abstract The proliferation of cooking videos on the internet these days necessitates the conversion of these lengthy video contents into concise text recipes. Many online platforms now have a large number of cooking videos, in which, there is a challenge for viewers to extract comprehensive recipes from lengthy visual content. Effective summary is necessary in order to translate the abundance of culinary knowledge found in videos into text recipes that are easy to read and follow. This will make the cooking process easier for individuals who are searching for precise step by step cooking instructions. Such a system satisfies the needs of a broad spectrum of learners while also improving accessibility and user simplicity. As there is a growing need for easy-to-follow recipes made from cooking videos, researchers are looking on the process of automated summarization using advanced techniques. One such approach is presented in our work, which combines simple image-based models, audio processing, and GPT-based models to create a system that makes it easier to turn long culinary videos into in-depth recipe texts. A systematic workflow is adopted in order to achieve the objective. Initially, Focus is given for frame summary generation which employs a combination of two convolutional neural networks and a GPT-based model. A pre-trained CNN model called Inception-V3 is fine-tuned with food image dataset for dish recognition and another custom-made CNN is built with ingredient images for ingredient recognition. Then a GPT based model is used to combine the results produced by the two CNN models which will give us the frame summary in the desired format. Subsequently, Audio summary generation is tackled by performing Speech-to-text functionality in python. A GPT-based model is then used to generate a summary of the resulting textual representation of audio in our desired format. Finally, to refine the summaries obtained from visual and auditory content, Another GPT-based model is used which combines the output of the frame summary and audio summary modules and give the final enhanced summary. By minimizing the complications involved with traditional and sophisticated methodologies, this research helps with the development of a straightforward but efficient cooking video summarization system. The results achieved in the work are on par with the existing work in the respective field which demonstrates comparable performance and efficacy in converting cooking videos into detailed recipe texts. |
format | Article |
id | doaj-art-fd8e75f9522c42a9890c0e35b880dc5e |
institution | Kabale University |
issn | 2731-0809 |
language | English |
publishDate | 2025-01-01 |
publisher | Springer |
record_format | Article |
series | Discover Artificial Intelligence |
spelling | doaj-art-fd8e75f9522c42a9890c0e35b880dc5e2025-01-26T12:43:02ZengSpringerDiscover Artificial Intelligence2731-08092025-01-015112010.1007/s44163-025-00230-yAutomatic summarization of cooking videos using transfer learning and transformer-based modelsP. M. Alen Sadique0R. V. Aswiga1School of Computer Science and Engineering, Vellore Institute of TechnologySchool of Computer Science and Engineering, Vellore Institute of TechnologyAbstract The proliferation of cooking videos on the internet these days necessitates the conversion of these lengthy video contents into concise text recipes. Many online platforms now have a large number of cooking videos, in which, there is a challenge for viewers to extract comprehensive recipes from lengthy visual content. Effective summary is necessary in order to translate the abundance of culinary knowledge found in videos into text recipes that are easy to read and follow. This will make the cooking process easier for individuals who are searching for precise step by step cooking instructions. Such a system satisfies the needs of a broad spectrum of learners while also improving accessibility and user simplicity. As there is a growing need for easy-to-follow recipes made from cooking videos, researchers are looking on the process of automated summarization using advanced techniques. One such approach is presented in our work, which combines simple image-based models, audio processing, and GPT-based models to create a system that makes it easier to turn long culinary videos into in-depth recipe texts. A systematic workflow is adopted in order to achieve the objective. Initially, Focus is given for frame summary generation which employs a combination of two convolutional neural networks and a GPT-based model. A pre-trained CNN model called Inception-V3 is fine-tuned with food image dataset for dish recognition and another custom-made CNN is built with ingredient images for ingredient recognition. Then a GPT based model is used to combine the results produced by the two CNN models which will give us the frame summary in the desired format. Subsequently, Audio summary generation is tackled by performing Speech-to-text functionality in python. A GPT-based model is then used to generate a summary of the resulting textual representation of audio in our desired format. Finally, to refine the summaries obtained from visual and auditory content, Another GPT-based model is used which combines the output of the frame summary and audio summary modules and give the final enhanced summary. By minimizing the complications involved with traditional and sophisticated methodologies, this research helps with the development of a straightforward but efficient cooking video summarization system. The results achieved in the work are on par with the existing work in the respective field which demonstrates comparable performance and efficacy in converting cooking videos into detailed recipe texts.https://doi.org/10.1007/s44163-025-00230-yAutomated summarizationTransfer learningComputer visionNatural language processingSpeech recognitionConvolutional neural network |
spellingShingle | P. M. Alen Sadique R. V. Aswiga Automatic summarization of cooking videos using transfer learning and transformer-based models Discover Artificial Intelligence Automated summarization Transfer learning Computer vision Natural language processing Speech recognition Convolutional neural network |
title | Automatic summarization of cooking videos using transfer learning and transformer-based models |
title_full | Automatic summarization of cooking videos using transfer learning and transformer-based models |
title_fullStr | Automatic summarization of cooking videos using transfer learning and transformer-based models |
title_full_unstemmed | Automatic summarization of cooking videos using transfer learning and transformer-based models |
title_short | Automatic summarization of cooking videos using transfer learning and transformer-based models |
title_sort | automatic summarization of cooking videos using transfer learning and transformer based models |
topic | Automated summarization Transfer learning Computer vision Natural language processing Speech recognition Convolutional neural network |
url | https://doi.org/10.1007/s44163-025-00230-y |
work_keys_str_mv | AT pmalensadique automaticsummarizationofcookingvideosusingtransferlearningandtransformerbasedmodels AT rvaswiga automaticsummarizationofcookingvideosusingtransferlearningandtransformerbasedmodels |