Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis

This paper explores advancements in real-time talking-head generation, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system...

Full description

Saved in:
Bibliographic Details
Main Authors: Pegah Salehi, Sajad Amouei Sheshkal, Vajira Thambawita, Sushant Gautam, Saeed S. Sabet, Dag Johansen, Michael A. Riegler, Pål Halvorsen
Format: Article
Language:English
Published: MDPI AG 2025-03-01
Series:Big Data and Cognitive Computing
Subjects:
Online Access:https://www.mdpi.com/2504-2289/9/3/59
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850089799918026752
author Pegah Salehi
Sajad Amouei Sheshkal
Vajira Thambawita
Sushant Gautam
Saeed S. Sabet
Dag Johansen
Michael A. Riegler
Pål Halvorsen
author_facet Pegah Salehi
Sajad Amouei Sheshkal
Vajira Thambawita
Sushant Gautam
Saeed S. Sabet
Dag Johansen
Michael A. Riegler
Pål Halvorsen
author_sort Pegah Salehi
collection DOAJ
description This paper explores advancements in real-time talking-head generation, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system that replaces conventional AFE models with OpenAI’s Whisper, leveraging its encoder to optimize processing and improve overall system efficiency. Our evaluation of two open-source real-time models across three different datasets shows that Whisper not only accelerates processing but also improves specific aspects of rendering quality, resulting in more realistic and responsive talking-head interactions. Although interviewer training systems are considered a potential application, the primary contribution of this work is the improvement of the technical foundations necessary for creating responsive AI avatars. These advancements enable more immersive interactions and expand the scope of AI-driven applications, including educational tools and simulated training environments.
format Article
id doaj-art-bc1d06f057ff423ba1d67fdff3ec3687
institution DOAJ
issn 2504-2289
language English
publishDate 2025-03-01
publisher MDPI AG
record_format Article
series Big Data and Cognitive Computing
spelling doaj-art-bc1d06f057ff423ba1d67fdff3ec36872025-08-20T02:42:41ZengMDPI AGBig Data and Cognitive Computing2504-22892025-03-01935910.3390/bdcc9030059Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait SynthesisPegah Salehi0Sajad Amouei Sheshkal1Vajira Thambawita2Sushant Gautam3Saeed S. Sabet4Dag Johansen5Michael A. Riegler6Pål Halvorsen7SimulaMet, 0167 Oslo, NorwaySimulaMet, 0167 Oslo, NorwaySimulaMet, 0167 Oslo, NorwaySimulaMet, 0167 Oslo, NorwayForzasys AS, 0840 Oslo, NorwayDepartment of Computer Science, UiT The Arctic University of Norway, 9019 Tromsø, NorwaySimula Research Laboratory, 0164 Oslo, NorwaySimulaMet, 0167 Oslo, NorwayThis paper explores advancements in real-time talking-head generation, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system that replaces conventional AFE models with OpenAI’s Whisper, leveraging its encoder to optimize processing and improve overall system efficiency. Our evaluation of two open-source real-time models across three different datasets shows that Whisper not only accelerates processing but also improves specific aspects of rendering quality, resulting in more realistic and responsive talking-head interactions. Although interviewer training systems are considered a potential application, the primary contribution of this work is the improvement of the technical foundations necessary for creating responsive AI avatars. These advancements enable more immersive interactions and expand the scope of AI-driven applications, including educational tools and simulated training environments.https://www.mdpi.com/2504-2289/9/3/59talking portrait synthesisinteractive avatarWhisperneural radiance fields (NeRFs)child protective services (CPS)
spellingShingle Pegah Salehi
Sajad Amouei Sheshkal
Vajira Thambawita
Sushant Gautam
Saeed S. Sabet
Dag Johansen
Michael A. Riegler
Pål Halvorsen
Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis
Big Data and Cognitive Computing
talking portrait synthesis
interactive avatar
Whisper
neural radiance fields (NeRFs)
child protective services (CPS)
title Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis
title_full Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis
title_fullStr Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis
title_full_unstemmed Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis
title_short Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis
title_sort comparative analysis of audio feature extraction for real time talking portrait synthesis
topic talking portrait synthesis
interactive avatar
Whisper
neural radiance fields (NeRFs)
child protective services (CPS)
url https://www.mdpi.com/2504-2289/9/3/59
work_keys_str_mv AT pegahsalehi comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis
AT sajadamoueisheshkal comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis
AT vajirathambawita comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis
AT sushantgautam comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis
AT saeedssabet comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis
AT dagjohansen comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis
AT michaelariegler comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis
AT palhalvorsen comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis