Wav2Lip Bridges Communication Gap: Automating Lip Sync and Language Translation for Indian Languages

Translating spoken speech in videos from one language to another is known as audio-visual translation (AVT). This paper describes the implementation of an automated AVT and lip-synced dubbing application. It addresses the difficulty of synchronizing mouth movements with translated speech by building...

Full description

Saved in:
Bibliographic Details
Main Authors: Vaishnavi Venkataraghavan, Shoba Sivapatham, Asutosh Kar
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10971971/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850281641902080000
author Vaishnavi Venkataraghavan
Shoba Sivapatham
Asutosh Kar
author_facet Vaishnavi Venkataraghavan
Shoba Sivapatham
Asutosh Kar
author_sort Vaishnavi Venkataraghavan
collection DOAJ
description Translating spoken speech in videos from one language to another is known as audio-visual translation (AVT). This paper describes the implementation of an automated AVT and lip-synced dubbing application. It addresses the difficulty of synchronizing mouth movements with translated speech by building a web application that synthesizes the speaker’s lip movements to match translated audio. Using ASR models, the speech from the source video is converted to text, translated into several languages, and then automatically synthesized into speech in the target language. A lip synchronization model, Wav2Lip, is used to alter the mouth movements in the video to correspond to the target language. We compare our work with two well-known ASR systems: Wav2vec 2.0 and Google Speech Recognition. Wav2vec 2.0 performs better with the lesser average WER% of 15.38 and is used in our final web application. The performance of the video dubbing component is discussed with the generated speech in Tamil, Telugu, Hindi, and English, and we determine that our generated videos outperform the existing ones. Our proposed AVT application is user-friendly for a wide variety of speakers, utilizing readily available TTS systems instead of training on an individual speaker’s voice.
format Article
id doaj-art-a656c873774f45eb8a96b02ecad422d9
institution OA Journals
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-a656c873774f45eb8a96b02ecad422d92025-08-20T01:48:12ZengIEEEIEEE Access2169-35362025-01-0113729067291710.1109/ACCESS.2025.356288310971971Wav2Lip Bridges Communication Gap: Automating Lip Sync and Language Translation for Indian LanguagesVaishnavi Venkataraghavan0Shoba Sivapatham1https://orcid.org/0000-0001-8036-2420Asutosh Kar2https://orcid.org/0000-0003-0011-0069School of Electronics Engineering, Vellore Institute of Technology, Chennai, IndiaCentre for Advanced Data Science, Vellore Institute of Technology, Chennai, IndiaDepartment of Electronics and Communication Engineering, Dr. B. R. Ambedkar National Institute of Technology, Jalandhar, Punjab, IndiaTranslating spoken speech in videos from one language to another is known as audio-visual translation (AVT). This paper describes the implementation of an automated AVT and lip-synced dubbing application. It addresses the difficulty of synchronizing mouth movements with translated speech by building a web application that synthesizes the speaker’s lip movements to match translated audio. Using ASR models, the speech from the source video is converted to text, translated into several languages, and then automatically synthesized into speech in the target language. A lip synchronization model, Wav2Lip, is used to alter the mouth movements in the video to correspond to the target language. We compare our work with two well-known ASR systems: Wav2vec 2.0 and Google Speech Recognition. Wav2vec 2.0 performs better with the lesser average WER% of 15.38 and is used in our final web application. The performance of the video dubbing component is discussed with the generated speech in Tamil, Telugu, Hindi, and English, and we determine that our generated videos outperform the existing ones. Our proposed AVT application is user-friendly for a wide variety of speakers, utilizing readily available TTS systems instead of training on an individual speaker’s voice.https://ieeexplore.ieee.org/document/10971971/Wav2Lipautomatic speech recognition (ASR)audio-visual translation (AVT)lip synchronizationgoogle speech recognition (GSR)Wav2vec 2.0
spellingShingle Vaishnavi Venkataraghavan
Shoba Sivapatham
Asutosh Kar
Wav2Lip Bridges Communication Gap: Automating Lip Sync and Language Translation for Indian Languages
IEEE Access
Wav2Lip
automatic speech recognition (ASR)
audio-visual translation (AVT)
lip synchronization
google speech recognition (GSR)
Wav2vec 2.0
title Wav2Lip Bridges Communication Gap: Automating Lip Sync and Language Translation for Indian Languages
title_full Wav2Lip Bridges Communication Gap: Automating Lip Sync and Language Translation for Indian Languages
title_fullStr Wav2Lip Bridges Communication Gap: Automating Lip Sync and Language Translation for Indian Languages
title_full_unstemmed Wav2Lip Bridges Communication Gap: Automating Lip Sync and Language Translation for Indian Languages
title_short Wav2Lip Bridges Communication Gap: Automating Lip Sync and Language Translation for Indian Languages
title_sort wav2lip bridges communication gap automating lip sync and language translation for indian languages
topic Wav2Lip
automatic speech recognition (ASR)
audio-visual translation (AVT)
lip synchronization
google speech recognition (GSR)
Wav2vec 2.0
url https://ieeexplore.ieee.org/document/10971971/
work_keys_str_mv AT vaishnavivenkataraghavan wav2lipbridgescommunicationgapautomatinglipsyncandlanguagetranslationforindianlanguages
AT shobasivapatham wav2lipbridgescommunicationgapautomatinglipsyncandlanguagetranslationforindianlanguages
AT asutoshkar wav2lipbridgescommunicationgapautomatinglipsyncandlanguagetranslationforindianlanguages