Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis
This paper explores advancements in real-time talking-head generation, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system...
Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-03-01
|
| Series: | Big Data and Cognitive Computing |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2504-2289/9/3/59 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850089799918026752 |
|---|---|
| author | Pegah Salehi Sajad Amouei Sheshkal Vajira Thambawita Sushant Gautam Saeed S. Sabet Dag Johansen Michael A. Riegler Pål Halvorsen |
| author_facet | Pegah Salehi Sajad Amouei Sheshkal Vajira Thambawita Sushant Gautam Saeed S. Sabet Dag Johansen Michael A. Riegler Pål Halvorsen |
| author_sort | Pegah Salehi |
| collection | DOAJ |
| description | This paper explores advancements in real-time talking-head generation, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system that replaces conventional AFE models with OpenAI’s Whisper, leveraging its encoder to optimize processing and improve overall system efficiency. Our evaluation of two open-source real-time models across three different datasets shows that Whisper not only accelerates processing but also improves specific aspects of rendering quality, resulting in more realistic and responsive talking-head interactions. Although interviewer training systems are considered a potential application, the primary contribution of this work is the improvement of the technical foundations necessary for creating responsive AI avatars. These advancements enable more immersive interactions and expand the scope of AI-driven applications, including educational tools and simulated training environments. |
| format | Article |
| id | doaj-art-bc1d06f057ff423ba1d67fdff3ec3687 |
| institution | DOAJ |
| issn | 2504-2289 |
| language | English |
| publishDate | 2025-03-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Big Data and Cognitive Computing |
| spelling | doaj-art-bc1d06f057ff423ba1d67fdff3ec36872025-08-20T02:42:41ZengMDPI AGBig Data and Cognitive Computing2504-22892025-03-01935910.3390/bdcc9030059Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait SynthesisPegah Salehi0Sajad Amouei Sheshkal1Vajira Thambawita2Sushant Gautam3Saeed S. Sabet4Dag Johansen5Michael A. Riegler6Pål Halvorsen7SimulaMet, 0167 Oslo, NorwaySimulaMet, 0167 Oslo, NorwaySimulaMet, 0167 Oslo, NorwaySimulaMet, 0167 Oslo, NorwayForzasys AS, 0840 Oslo, NorwayDepartment of Computer Science, UiT The Arctic University of Norway, 9019 Tromsø, NorwaySimula Research Laboratory, 0164 Oslo, NorwaySimulaMet, 0167 Oslo, NorwayThis paper explores advancements in real-time talking-head generation, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system that replaces conventional AFE models with OpenAI’s Whisper, leveraging its encoder to optimize processing and improve overall system efficiency. Our evaluation of two open-source real-time models across three different datasets shows that Whisper not only accelerates processing but also improves specific aspects of rendering quality, resulting in more realistic and responsive talking-head interactions. Although interviewer training systems are considered a potential application, the primary contribution of this work is the improvement of the technical foundations necessary for creating responsive AI avatars. These advancements enable more immersive interactions and expand the scope of AI-driven applications, including educational tools and simulated training environments.https://www.mdpi.com/2504-2289/9/3/59talking portrait synthesisinteractive avatarWhisperneural radiance fields (NeRFs)child protective services (CPS) |
| spellingShingle | Pegah Salehi Sajad Amouei Sheshkal Vajira Thambawita Sushant Gautam Saeed S. Sabet Dag Johansen Michael A. Riegler Pål Halvorsen Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis Big Data and Cognitive Computing talking portrait synthesis interactive avatar Whisper neural radiance fields (NeRFs) child protective services (CPS) |
| title | Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis |
| title_full | Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis |
| title_fullStr | Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis |
| title_full_unstemmed | Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis |
| title_short | Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis |
| title_sort | comparative analysis of audio feature extraction for real time talking portrait synthesis |
| topic | talking portrait synthesis interactive avatar Whisper neural radiance fields (NeRFs) child protective services (CPS) |
| url | https://www.mdpi.com/2504-2289/9/3/59 |
| work_keys_str_mv | AT pegahsalehi comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis AT sajadamoueisheshkal comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis AT vajirathambawita comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis AT sushantgautam comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis AT saeedssabet comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis AT dagjohansen comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis AT michaelariegler comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis AT palhalvorsen comparativeanalysisofaudiofeatureextractionforrealtimetalkingportraitsynthesis |