Audio2DiffuGesture: Generating a diverse co-speech gesture based on a diffusion model

People use a combination of language and gestures to convey intentions, making the generation of natural co-speech gestures a challenging task. In audio-driven gesture generation, relying solely on features extracted from raw audio waveforms limits the model's ability to fully learn the joint d...

Full description

Saved in:

Bibliographic Details
Main Authors:	Hongze Yao, Yingting Xu, Weitao WU, Huabin He, Wen Ren, Zhiming Cai
Format:	Article
Language:	English
Published:	AIMS Press 2024-09-01
Series:	Electronic Research Archive
Subjects:	co-speech gesture cross-modal human-computer interaction diffusion model attention mechanism
Online Access:	https://www.aimspress.com/article/doi/10.3934/era.2024250
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832590786479259648
author	Hongze Yao Yingting Xu Weitao WU Huabin He Wen Ren Zhiming Cai
author_facet	Hongze Yao Yingting Xu Weitao WU Huabin He Wen Ren Zhiming Cai
author_sort	Hongze Yao
collection	DOAJ
description	People use a combination of language and gestures to convey intentions, making the generation of natural co-speech gestures a challenging task. In audio-driven gesture generation, relying solely on features extracted from raw audio waveforms limits the model's ability to fully learn the joint distribution between audio and gestures. To address this limitation, we integrated key features from both raw audio waveforms and Mel-spectrograms. Specifically, we employed cascaded 1D convolutions to extract features from the audio waveform and a two-stage attention mechanism to capture features from the Mel-spectrogram. The fused features were then input into a Transformer with cross-dimension attention for sequence modeling, which mitigated accumulated non-autoregressive errors and reduced redundant information. We developed a diffusion model-based Audio to Diffusion Gesture (A2DG) generation pipeline capable of producing high-quality and diverse gestures. Our method demonstrated superior performance in extensive experiments compared to established baselines. Regarding the TED Gesture and TED Expressive datasets, the Fréchet Gesture Distance (FGD) performance improved by 16.8 and 56%, respectively. Additionally, a user study validated that the co-speech gestures generated by our method are more vivid and realistic.
format	Article
id	doaj-art-f383565bac7a4a909aecdc91ec948ada
institution	Kabale University
issn	2688-1594
language	English
publishDate	2024-09-01
publisher	AIMS Press
record_format	Article
series	Electronic Research Archive
spelling	doaj-art-f383565bac7a4a909aecdc91ec948ada2025-01-23T07:52:42ZengAIMS PressElectronic Research Archive2688-15942024-09-013295392540810.3934/era.2024250Audio2DiffuGesture: Generating a diverse co-speech gesture based on a diffusion modelHongze Yao0Yingting Xu1Weitao WU2Huabin He3Wen Ren4Zhiming Cai5School of Electronics, Electrical Engineering and Physics, Fujian University of Technology, Fuzhou 350118, ChinaSchool of Electronics, Electrical Engineering and Physics, Fujian University of Technology, Fuzhou 350118, ChinaSchool of Electronics, Electrical Engineering and Physics, Fujian University of Technology, Fuzhou 350118, ChinaSchool of Electronics, Electrical Engineering and Physics, Fujian University of Technology, Fuzhou 350118, ChinaSchool of Mechanical and Electric Engineering, Sanming University, Sanming 365004, ChinaSchool of Electronics, Electrical Engineering and Physics, Fujian University of Technology, Fuzhou 350118, ChinaPeople use a combination of language and gestures to convey intentions, making the generation of natural co-speech gestures a challenging task. In audio-driven gesture generation, relying solely on features extracted from raw audio waveforms limits the model's ability to fully learn the joint distribution between audio and gestures. To address this limitation, we integrated key features from both raw audio waveforms and Mel-spectrograms. Specifically, we employed cascaded 1D convolutions to extract features from the audio waveform and a two-stage attention mechanism to capture features from the Mel-spectrogram. The fused features were then input into a Transformer with cross-dimension attention for sequence modeling, which mitigated accumulated non-autoregressive errors and reduced redundant information. We developed a diffusion model-based Audio to Diffusion Gesture (A2DG) generation pipeline capable of producing high-quality and diverse gestures. Our method demonstrated superior performance in extensive experiments compared to established baselines. Regarding the TED Gesture and TED Expressive datasets, the Fréchet Gesture Distance (FGD) performance improved by 16.8 and 56%, respectively. Additionally, a user study validated that the co-speech gestures generated by our method are more vivid and realistic.https://www.aimspress.com/article/doi/10.3934/era.2024250co-speech gesturecross-modalhuman-computer interactiondiffusion modelattention mechanism
spellingShingle	Hongze Yao Yingting Xu Weitao WU Huabin He Wen Ren Zhiming Cai Audio2DiffuGesture: Generating a diverse co-speech gesture based on a diffusion model Electronic Research Archive co-speech gesture cross-modal human-computer interaction diffusion model attention mechanism
title	Audio2DiffuGesture: Generating a diverse co-speech gesture based on a diffusion model
title_full	Audio2DiffuGesture: Generating a diverse co-speech gesture based on a diffusion model
title_fullStr	Audio2DiffuGesture: Generating a diverse co-speech gesture based on a diffusion model
title_full_unstemmed	Audio2DiffuGesture: Generating a diverse co-speech gesture based on a diffusion model
title_short	Audio2DiffuGesture: Generating a diverse co-speech gesture based on a diffusion model
title_sort	audio2diffugesture generating a diverse co speech gesture based on a diffusion model
topic	co-speech gesture cross-modal human-computer interaction diffusion model attention mechanism
url	https://www.aimspress.com/article/doi/10.3934/era.2024250
work_keys_str_mv	AT hongzeyao audio2diffugesturegeneratingadiversecospeechgesturebasedonadiffusionmodel AT yingtingxu audio2diffugesturegeneratingadiversecospeechgesturebasedonadiffusionmodel AT weitaowu audio2diffugesturegeneratingadiversecospeechgesturebasedonadiffusionmodel AT huabinhe audio2diffugesturegeneratingadiversecospeechgesturebasedonadiffusionmodel AT wenren audio2diffugesturegeneratingadiversecospeechgesturebasedonadiffusionmodel AT zhimingcai audio2diffugesturegeneratingadiversecospeechgesturebasedonadiffusionmodel

Audio2DiffuGesture: Generating a diverse co-speech gesture based on a diffusion model

Similar Items