Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis

This paper explores advancements in real-time talking-head generation, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system...

Full description

Saved in:

Bibliographic Details
Main Authors:	Pegah Salehi, Sajad Amouei Sheshkal, Vajira Thambawita, Sushant Gautam, Saeed S. Sabet, Dag Johansen, Michael A. Riegler, Pål Halvorsen
Format:	Article
Language:	English
Published:	MDPI AG 2025-03-01
Series:	Big Data and Cognitive Computing
Subjects:	talking portrait synthesis interactive avatar Whisper neural radiance fields (NeRFs) child protective services (CPS)
Online Access:	https://www.mdpi.com/2504-2289/9/3/59
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850089799918026752
author	Pegah Salehi Sajad Amouei Sheshkal Vajira Thambawita Sushant Gautam Saeed S. Sabet Dag Johansen Michael A. Riegler Pål Halvorsen
author_facet	Pegah Salehi Sajad Amouei Sheshkal Vajira Thambawita Sushant Gautam Saeed S. Sabet Dag Johansen Michael A. Riegler Pål Halvorsen
author_sort	Pegah Salehi
collection	DOAJ
description	This paper explores advancements in real-time talking-head generation, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system that replaces conventional AFE models with OpenAI’s Whisper, leveraging its encoder to optimize processing and improve overall system efficiency. Our evaluation of two open-source real-time models across three different datasets shows that Whisper not only accelerates processing but also improves specific aspects of rendering quality, resulting in more realistic and responsive talking-head interactions. Although interviewer training systems are considered a potential application, the primary contribution of this work is the improvement of the technical foundations necessary for creating responsive AI avatars. These advancements enable more immersive interactions and expand the scope of AI-driven applications, including educational tools and simulated training environments.
format	Article
id	doaj-art-bc1d06f057ff423ba1d67fdff3ec3687
institution	DOAJ
issn	2504-2289
language	English
publishDate	2025-03-01
publisher	MDPI AG
record_format	Article
series	Big Data and Cognitive Computing
spelling	doaj-art-bc1d06f057ff423ba1d67fdff3ec36872025-08-20T02:42:41ZengMDPI AGBig Data and Cognitive Computing2504-22892025-03-01935910.3390/bdcc9030059Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait SynthesisPegah Salehi0Sajad Amouei Sheshkal1Vajira Thambawita2Sushant Gautam3Saeed S. Sabet4Dag Johansen5Michael A. Riegler6Pål Halvorsen7SimulaMet, 0167 Oslo, NorwaySimulaMet, 0167 Oslo, NorwaySimulaMet, 0167 Oslo, NorwaySimulaMet, 0167 Oslo, NorwayForzasys AS, 0840 Oslo, NorwayDepartment of Computer Science, UiT The Arctic University of Norway, 9019 Tromsø, NorwaySimula Research Laboratory, 0164 Oslo, NorwaySimulaMet, 0167 Oslo, NorwayThis paper explores advancements in real-time talking-head generation, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system that replaces conventional AFE models with OpenAI’s Whisper, leveraging its encoder to optimize processing and improve overall system efficiency. Our evaluation of two open-source real-time models across three different datasets shows that Whisper not only accelerates processing but also improves specific aspects of rendering quality, resulting in more realistic and responsive talking-head interactions. Although interviewer training systems are considered a potential application, the primary contribution of this work is the improvement of the technical foundations necessary for creating responsive AI avatars. These advancements enable more immersive interactions and expand the scope of AI-driven applications, including educational tools and simulated training environments.https://www.mdpi.com/2504-2289/9/3/59talking portrait synthesisinteractive avatarWhisperneural radiance fields (NeRFs)child protective services (CPS)
spellingShingle	Pegah Salehi Sajad Amouei Sheshkal Vajira Thambawita Sushant Gautam Saeed S. Sabet Dag Johansen Michael A. Riegler Pål Halvorsen Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis Big Data and Cognitive Computing talking portrait synthesis interactive avatar Whisper neural radiance fields (NeRFs) child protective services (CPS)
title	Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis
title_full	Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis
title_fullStr	Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis
title_full_unstemmed	Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis
title_short	Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis
title_sort	comparative analysis of audio feature extraction for real time talking portrait synthesis
topic	talking portrait synthesis interactive avatar Whisper neural radiance fields (NeRFs) child protective services (CPS)
url	https://www.mdpi.com/2504-2289/9/3/59
work_keys_str_mv	AT pegahsalehi comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis AT sajadamoueisheshkal comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis AT vajirathambawita comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis AT sushantgautam comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis AT saeedssabet comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis AT dagjohansen comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis AT michaelariegler comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis AT palhalvorsen comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis

Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis

Similar Items