Wav2Lip Bridges Communication Gap: Automating Lip Sync and Language Translation for Indian Languages
Translating spoken speech in videos from one language to another is known as audio-visual translation (AVT). This paper describes the implementation of an automated AVT and lip-synced dubbing application. It addresses the difficulty of synchronizing mouth movements with translated speech by building...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10971971/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850281641902080000 |
|---|---|
| author | Vaishnavi Venkataraghavan Shoba Sivapatham Asutosh Kar |
| author_facet | Vaishnavi Venkataraghavan Shoba Sivapatham Asutosh Kar |
| author_sort | Vaishnavi Venkataraghavan |
| collection | DOAJ |
| description | Translating spoken speech in videos from one language to another is known as audio-visual translation (AVT). This paper describes the implementation of an automated AVT and lip-synced dubbing application. It addresses the difficulty of synchronizing mouth movements with translated speech by building a web application that synthesizes the speaker’s lip movements to match translated audio. Using ASR models, the speech from the source video is converted to text, translated into several languages, and then automatically synthesized into speech in the target language. A lip synchronization model, Wav2Lip, is used to alter the mouth movements in the video to correspond to the target language. We compare our work with two well-known ASR systems: Wav2vec 2.0 and Google Speech Recognition. Wav2vec 2.0 performs better with the lesser average WER% of 15.38 and is used in our final web application. The performance of the video dubbing component is discussed with the generated speech in Tamil, Telugu, Hindi, and English, and we determine that our generated videos outperform the existing ones. Our proposed AVT application is user-friendly for a wide variety of speakers, utilizing readily available TTS systems instead of training on an individual speaker’s voice. |
| format | Article |
| id | doaj-art-a656c873774f45eb8a96b02ecad422d9 |
| institution | OA Journals |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-a656c873774f45eb8a96b02ecad422d92025-08-20T01:48:12ZengIEEEIEEE Access2169-35362025-01-0113729067291710.1109/ACCESS.2025.356288310971971Wav2Lip Bridges Communication Gap: Automating Lip Sync and Language Translation for Indian LanguagesVaishnavi Venkataraghavan0Shoba Sivapatham1https://orcid.org/0000-0001-8036-2420Asutosh Kar2https://orcid.org/0000-0003-0011-0069School of Electronics Engineering, Vellore Institute of Technology, Chennai, IndiaCentre for Advanced Data Science, Vellore Institute of Technology, Chennai, IndiaDepartment of Electronics and Communication Engineering, Dr. B. R. Ambedkar National Institute of Technology, Jalandhar, Punjab, IndiaTranslating spoken speech in videos from one language to another is known as audio-visual translation (AVT). This paper describes the implementation of an automated AVT and lip-synced dubbing application. It addresses the difficulty of synchronizing mouth movements with translated speech by building a web application that synthesizes the speaker’s lip movements to match translated audio. Using ASR models, the speech from the source video is converted to text, translated into several languages, and then automatically synthesized into speech in the target language. A lip synchronization model, Wav2Lip, is used to alter the mouth movements in the video to correspond to the target language. We compare our work with two well-known ASR systems: Wav2vec 2.0 and Google Speech Recognition. Wav2vec 2.0 performs better with the lesser average WER% of 15.38 and is used in our final web application. The performance of the video dubbing component is discussed with the generated speech in Tamil, Telugu, Hindi, and English, and we determine that our generated videos outperform the existing ones. Our proposed AVT application is user-friendly for a wide variety of speakers, utilizing readily available TTS systems instead of training on an individual speaker’s voice.https://ieeexplore.ieee.org/document/10971971/Wav2Lipautomatic speech recognition (ASR)audio-visual translation (AVT)lip synchronizationgoogle speech recognition (GSR)Wav2vec 2.0 |
| spellingShingle | Vaishnavi Venkataraghavan Shoba Sivapatham Asutosh Kar Wav2Lip Bridges Communication Gap: Automating Lip Sync and Language Translation for Indian Languages IEEE Access Wav2Lip automatic speech recognition (ASR) audio-visual translation (AVT) lip synchronization google speech recognition (GSR) Wav2vec 2.0 |
| title | Wav2Lip Bridges Communication Gap: Automating Lip Sync and Language Translation for Indian Languages |
| title_full | Wav2Lip Bridges Communication Gap: Automating Lip Sync and Language Translation for Indian Languages |
| title_fullStr | Wav2Lip Bridges Communication Gap: Automating Lip Sync and Language Translation for Indian Languages |
| title_full_unstemmed | Wav2Lip Bridges Communication Gap: Automating Lip Sync and Language Translation for Indian Languages |
| title_short | Wav2Lip Bridges Communication Gap: Automating Lip Sync and Language Translation for Indian Languages |
| title_sort | wav2lip bridges communication gap automating lip sync and language translation for indian languages |
| topic | Wav2Lip automatic speech recognition (ASR) audio-visual translation (AVT) lip synchronization google speech recognition (GSR) Wav2vec 2.0 |
| url | https://ieeexplore.ieee.org/document/10971971/ |
| work_keys_str_mv | AT vaishnavivenkataraghavan wav2lipbridgescommunicationgapautomatinglipsyncandlanguagetranslationforindianlanguages AT shobasivapatham wav2lipbridgescommunicationgapautomatinglipsyncandlanguagetranslationforindianlanguages AT asutoshkar wav2lipbridgescommunicationgapautomatinglipsyncandlanguagetranslationforindianlanguages |