Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis

This paper explores advancements in real-time talking-head generation, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system...

Full description

Saved in:
Bibliographic Details
Main Authors: Pegah Salehi, Sajad Amouei Sheshkal, Vajira Thambawita, Sushant Gautam, Saeed S. Sabet, Dag Johansen, Michael A. Riegler, Pål Halvorsen
Format: Article
Language:English
Published: MDPI AG 2025-03-01
Series:Big Data and Cognitive Computing
Subjects:
Online Access:https://www.mdpi.com/2504-2289/9/3/59
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This paper explores advancements in real-time talking-head generation, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system that replaces conventional AFE models with OpenAI’s Whisper, leveraging its encoder to optimize processing and improve overall system efficiency. Our evaluation of two open-source real-time models across three different datasets shows that Whisper not only accelerates processing but also improves specific aspects of rendering quality, resulting in more realistic and responsive talking-head interactions. Although interviewer training systems are considered a potential application, the primary contribution of this work is the improvement of the technical foundations necessary for creating responsive AI avatars. These advancements enable more immersive interactions and expand the scope of AI-driven applications, including educational tools and simulated training environments.
ISSN:2504-2289