Realistic Speech-Driven Talking Video Generation with Personalized Pose

In this work, we propose a method to transform a speaker’s speech information into a target character’s talking video; the method could make the mouth shape synchronization, expression, and body posture more realistic in the synthesized speaker video. This is a challenging task because changes of mo...

Full description

Saved in:
Bibliographic Details
Main Authors: Xu Zhang, Liguo Weng
Format: Article
Language:English
Published: Wiley 2020-01-01
Series:Complexity
Online Access:http://dx.doi.org/10.1155/2020/6629634
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849308686653063168
author Xu Zhang
Liguo Weng
author_facet Xu Zhang
Liguo Weng
author_sort Xu Zhang
collection DOAJ
description In this work, we propose a method to transform a speaker’s speech information into a target character’s talking video; the method could make the mouth shape synchronization, expression, and body posture more realistic in the synthesized speaker video. This is a challenging task because changes of mouth shape and posture are coupled with audio semantic information. The model training is difficult to converge, and the model effect is unstable in complex scenes. Existing speech-driven speaker methods cannot solve this problem well. The method proposed in this paper first generates the sequence of key points of the speaker’s face and body postures from the audio signal in real time and then visualizes these key points as a series of two-dimensional skeleton images. Subsequently, we generate the final real speaker video through the video generation network. We take a random sampling of audio clips, encode audio contents and temporal correlations using a more effective network structure, and optimize and iterate network outputs using differential loss and attitude perception loss, so as to obtain a smoother pose key-point sequence and better performance. In addition, by inserting a specified action frame into the synthesized human pose sequence window, action poses of the synthesized speaker are enriched, making the synthesis effect more realistic and natural. Then, the final speaker video is generated by the obtained gesture key points through the video generation network. In order to generate realistic and high-resolution pose detail videos, we insert a local attention mechanism into the key point network of the generated pose sequence and give higher attention to the local details of the characters through spatial weight masks. In order to verify the effectiveness of the proposed method, we used the objective evaluation index NME and user subjective evaluation methods, respectively. Experiment results showed that our method could vividly use audio contentsto generate corresponding speaker videos, and its lip-matching accuracy and expression postures are better than those of previous work. Compared with existing methods in the NME index and user subjective evaluation, our method showed better results.
format Article
id doaj-art-15bae4e5a72a4a18ad6a1dbdcd8ba557
institution Kabale University
issn 1076-2787
1099-0526
language English
publishDate 2020-01-01
publisher Wiley
record_format Article
series Complexity
spelling doaj-art-15bae4e5a72a4a18ad6a1dbdcd8ba5572025-08-20T03:54:24ZengWileyComplexity1076-27871099-05262020-01-01202010.1155/2020/66296346629634Realistic Speech-Driven Talking Video Generation with Personalized PoseXu Zhang0Liguo Weng1Jiangsu Key Laboratory of Big Data Analysis Technology, Nanjing University of Information Science and Technology, Nanjing 210044, ChinaJiangsu Key Laboratory of Big Data Analysis Technology, Nanjing University of Information Science and Technology, Nanjing 210044, ChinaIn this work, we propose a method to transform a speaker’s speech information into a target character’s talking video; the method could make the mouth shape synchronization, expression, and body posture more realistic in the synthesized speaker video. This is a challenging task because changes of mouth shape and posture are coupled with audio semantic information. The model training is difficult to converge, and the model effect is unstable in complex scenes. Existing speech-driven speaker methods cannot solve this problem well. The method proposed in this paper first generates the sequence of key points of the speaker’s face and body postures from the audio signal in real time and then visualizes these key points as a series of two-dimensional skeleton images. Subsequently, we generate the final real speaker video through the video generation network. We take a random sampling of audio clips, encode audio contents and temporal correlations using a more effective network structure, and optimize and iterate network outputs using differential loss and attitude perception loss, so as to obtain a smoother pose key-point sequence and better performance. In addition, by inserting a specified action frame into the synthesized human pose sequence window, action poses of the synthesized speaker are enriched, making the synthesis effect more realistic and natural. Then, the final speaker video is generated by the obtained gesture key points through the video generation network. In order to generate realistic and high-resolution pose detail videos, we insert a local attention mechanism into the key point network of the generated pose sequence and give higher attention to the local details of the characters through spatial weight masks. In order to verify the effectiveness of the proposed method, we used the objective evaluation index NME and user subjective evaluation methods, respectively. Experiment results showed that our method could vividly use audio contentsto generate corresponding speaker videos, and its lip-matching accuracy and expression postures are better than those of previous work. Compared with existing methods in the NME index and user subjective evaluation, our method showed better results.http://dx.doi.org/10.1155/2020/6629634
spellingShingle Xu Zhang
Liguo Weng
Realistic Speech-Driven Talking Video Generation with Personalized Pose
Complexity
title Realistic Speech-Driven Talking Video Generation with Personalized Pose
title_full Realistic Speech-Driven Talking Video Generation with Personalized Pose
title_fullStr Realistic Speech-Driven Talking Video Generation with Personalized Pose
title_full_unstemmed Realistic Speech-Driven Talking Video Generation with Personalized Pose
title_short Realistic Speech-Driven Talking Video Generation with Personalized Pose
title_sort realistic speech driven talking video generation with personalized pose
url http://dx.doi.org/10.1155/2020/6629634
work_keys_str_mv AT xuzhang realisticspeechdriventalkingvideogenerationwithpersonalizedpose
AT liguoweng realisticspeechdriventalkingvideogenerationwithpersonalizedpose