A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer

This paper presents the first sentence-level visual speech recognition (VSR) system specifically designed for the Vietnamese language. We have developed a unique dataset comprising 115 h of video recordings from over 100 speakers, focusing on single-speaker scenarios. The proposed VSR system utilize...

Full description

Saved in:
Bibliographic Details
Main Authors: Phat Nguyen Huu, Thach Ho Sy
Format: Article
Language:English
Published: Wiley 2025-01-01
Series:Modelling and Simulation in Engineering
Online Access:http://dx.doi.org/10.1155/mse/2087573
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850123730433343488
author Phat Nguyen Huu
Thach Ho Sy
author_facet Phat Nguyen Huu
Thach Ho Sy
author_sort Phat Nguyen Huu
collection DOAJ
description This paper presents the first sentence-level visual speech recognition (VSR) system specifically designed for the Vietnamese language. We have developed a unique dataset comprising 115 h of video recordings from over 100 speakers, focusing on single-speaker scenarios. The proposed VSR system utilizes a ResNet3D architecture as the visual frontend, paired with a neural transducer framework featuring a Zipformer speech encoder. It incorporates a stateless decoder that considers two preceding tokens and is optimized with a pruned-RNNT loss function. Experimental results show that our system achieves a word error rate (WER) of 27.14% and a character error rate (CER) of 20.45% on single-speaker tasks, demonstrating significant progress in VSR for Vietnamese.
format Article
id doaj-art-d93de22e364747cba55c1a6bf963ebb6
institution OA Journals
issn 1687-5605
language English
publishDate 2025-01-01
publisher Wiley
record_format Article
series Modelling and Simulation in Engineering
spelling doaj-art-d93de22e364747cba55c1a6bf963ebb62025-08-20T02:34:32ZengWileyModelling and Simulation in Engineering1687-56052025-01-01202510.1155/mse/2087573A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and ZipformerPhat Nguyen Huu0Thach Ho Sy1School of Electronic and Electrical EngineeringSchool of Electronic and Electrical EngineeringThis paper presents the first sentence-level visual speech recognition (VSR) system specifically designed for the Vietnamese language. We have developed a unique dataset comprising 115 h of video recordings from over 100 speakers, focusing on single-speaker scenarios. The proposed VSR system utilizes a ResNet3D architecture as the visual frontend, paired with a neural transducer framework featuring a Zipformer speech encoder. It incorporates a stateless decoder that considers two preceding tokens and is optimized with a pruned-RNNT loss function. Experimental results show that our system achieves a word error rate (WER) of 27.14% and a character error rate (CER) of 20.45% on single-speaker tasks, demonstrating significant progress in VSR for Vietnamese.http://dx.doi.org/10.1155/mse/2087573
spellingShingle Phat Nguyen Huu
Thach Ho Sy
A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer
Modelling and Simulation in Engineering
title A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer
title_full A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer
title_fullStr A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer
title_full_unstemmed A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer
title_short A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer
title_sort novel sentence level visual speech recognition system for vietnamese language using resnet3d and zipformer
url http://dx.doi.org/10.1155/mse/2087573
work_keys_str_mv AT phatnguyenhuu anovelsentencelevelvisualspeechrecognitionsystemforvietnameselanguageusingresnet3dandzipformer
AT thachhosy anovelsentencelevelvisualspeechrecognitionsystemforvietnameselanguageusingresnet3dandzipformer
AT phatnguyenhuu novelsentencelevelvisualspeechrecognitionsystemforvietnameselanguageusingresnet3dandzipformer
AT thachhosy novelsentencelevelvisualspeechrecognitionsystemforvietnameselanguageusingresnet3dandzipformer