Speech Intelligibility in Virtual Avatars: Comparison Between Audio and Audio–Visual-Driven Facial Animation

Speech intelligibility (SI) is critical in effective communication across various settings, although it is often compromised by adverse acoustic conditions. In noisy environments, visual cues such as lip movements and facial expressions, when congruent with auditory information, can significantly en...

Full description

Saved in:
Bibliographic Details
Main Authors: Federico Cioffi, Massimiliano Masullo, Aniello Pascale, Luigi Maffei
Format: Article
Language:English
Published: MDPI AG 2025-05-01
Series:Acoustics
Subjects:
Online Access:https://www.mdpi.com/2624-599X/7/2/30
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850157847430561792
author Federico Cioffi
Massimiliano Masullo
Aniello Pascale
Luigi Maffei
author_facet Federico Cioffi
Massimiliano Masullo
Aniello Pascale
Luigi Maffei
author_sort Federico Cioffi
collection DOAJ
description Speech intelligibility (SI) is critical in effective communication across various settings, although it is often compromised by adverse acoustic conditions. In noisy environments, visual cues such as lip movements and facial expressions, when congruent with auditory information, can significantly enhance speech perception and reduce cognitive effort. In an ever-growing diffusion of virtual environments, communicating through virtual avatars is becoming increasingly prevalent, thus requiring a comprehensive understanding of these dynamics to ensure effective interactions. The present study used Unreal Engine’s MetaHuman technology to compare four methodologies used to create facial animation: MetaHuman Animator (MHA), MetaHuman LiveLink (MHLL), Audio-Driven MetaHuman (ADMH), and Synthetized Audio-Driven MetaHuman (SADMH). Thirty-six word pairs from the Diagnostic Rhyme Test (DRT) were used as input stimuli to create the animations and to compare them in terms of intelligibility. Moreover, to simulate a challenging background noise, the animations were mixed with a babble noise at a signal-to-noise ratio of −13 dB (A). Participants assessed a total of 144 facial animations. Results showed the ADMH condition to be the most intelligible among the methodologies used, probably due to enhanced clarity and consistency in the generated facial animations, while eliminating distractions like micro-expressions and natural variations in human articulation.
format Article
id doaj-art-e0c350596d0a4d87bed3567eccc5b6cb
institution OA Journals
issn 2624-599X
language English
publishDate 2025-05-01
publisher MDPI AG
record_format Article
series Acoustics
spelling doaj-art-e0c350596d0a4d87bed3567eccc5b6cb2025-08-20T02:24:03ZengMDPI AGAcoustics2624-599X2025-05-01723010.3390/acoustics7020030Speech Intelligibility in Virtual Avatars: Comparison Between Audio and Audio–Visual-Driven Facial AnimationFederico Cioffi0Massimiliano Masullo1Aniello Pascale2Luigi Maffei3Department of Architecture and Industrial Design, Università degli Studi della Campania “Luigi Vanvitelli”, 81031 Aversa, CE, ItalyDepartment of Architecture and Industrial Design, Università degli Studi della Campania “Luigi Vanvitelli”, 81031 Aversa, CE, ItalyImmensive s.r.l.s., 81030 Parete, CE, ItalyDepartment of Architecture and Industrial Design, Università degli Studi della Campania “Luigi Vanvitelli”, 81031 Aversa, CE, ItalySpeech intelligibility (SI) is critical in effective communication across various settings, although it is often compromised by adverse acoustic conditions. In noisy environments, visual cues such as lip movements and facial expressions, when congruent with auditory information, can significantly enhance speech perception and reduce cognitive effort. In an ever-growing diffusion of virtual environments, communicating through virtual avatars is becoming increasingly prevalent, thus requiring a comprehensive understanding of these dynamics to ensure effective interactions. The present study used Unreal Engine’s MetaHuman technology to compare four methodologies used to create facial animation: MetaHuman Animator (MHA), MetaHuman LiveLink (MHLL), Audio-Driven MetaHuman (ADMH), and Synthetized Audio-Driven MetaHuman (SADMH). Thirty-six word pairs from the Diagnostic Rhyme Test (DRT) were used as input stimuli to create the animations and to compare them in terms of intelligibility. Moreover, to simulate a challenging background noise, the animations were mixed with a babble noise at a signal-to-noise ratio of −13 dB (A). Participants assessed a total of 144 facial animations. Results showed the ADMH condition to be the most intelligible among the methodologies used, probably due to enhanced clarity and consistency in the generated facial animations, while eliminating distractions like micro-expressions and natural variations in human articulation.https://www.mdpi.com/2624-599X/7/2/30virtual realityavatarfacial animationunreal engineMetaHumanspeech intelligibility
spellingShingle Federico Cioffi
Massimiliano Masullo
Aniello Pascale
Luigi Maffei
Speech Intelligibility in Virtual Avatars: Comparison Between Audio and Audio–Visual-Driven Facial Animation
Acoustics
virtual reality
avatar
facial animation
unreal engine
MetaHuman
speech intelligibility
title Speech Intelligibility in Virtual Avatars: Comparison Between Audio and Audio–Visual-Driven Facial Animation
title_full Speech Intelligibility in Virtual Avatars: Comparison Between Audio and Audio–Visual-Driven Facial Animation
title_fullStr Speech Intelligibility in Virtual Avatars: Comparison Between Audio and Audio–Visual-Driven Facial Animation
title_full_unstemmed Speech Intelligibility in Virtual Avatars: Comparison Between Audio and Audio–Visual-Driven Facial Animation
title_short Speech Intelligibility in Virtual Avatars: Comparison Between Audio and Audio–Visual-Driven Facial Animation
title_sort speech intelligibility in virtual avatars comparison between audio and audio visual driven facial animation
topic virtual reality
avatar
facial animation
unreal engine
MetaHuman
speech intelligibility
url https://www.mdpi.com/2624-599X/7/2/30
work_keys_str_mv AT federicocioffi speechintelligibilityinvirtualavatarscomparisonbetweenaudioandaudiovisualdrivenfacialanimation
AT massimilianomasullo speechintelligibilityinvirtualavatarscomparisonbetweenaudioandaudiovisualdrivenfacialanimation
AT aniellopascale speechintelligibilityinvirtualavatarscomparisonbetweenaudioandaudiovisualdrivenfacialanimation
AT luigimaffei speechintelligibilityinvirtualavatarscomparisonbetweenaudioandaudiovisualdrivenfacialanimation