A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer

This paper presents the first sentence-level visual speech recognition (VSR) system specifically designed for the Vietnamese language. We have developed a unique dataset comprising 115 h of video recordings from over 100 speakers, focusing on single-speaker scenarios. The proposed VSR system utilize...

Full description

Saved in:

Bibliographic Details
Main Authors:	Phat Nguyen Huu, Thach Ho Sy
Format:	Article
Language:	English
Published:	Wiley 2025-01-01
Series:	Modelling and Simulation in Engineering
Online Access:	http://dx.doi.org/10.1155/mse/2087573
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850123730433343488
author	Phat Nguyen Huu Thach Ho Sy
author_facet	Phat Nguyen Huu Thach Ho Sy
author_sort	Phat Nguyen Huu
collection	DOAJ
description	This paper presents the first sentence-level visual speech recognition (VSR) system specifically designed for the Vietnamese language. We have developed a unique dataset comprising 115 h of video recordings from over 100 speakers, focusing on single-speaker scenarios. The proposed VSR system utilizes a ResNet3D architecture as the visual frontend, paired with a neural transducer framework featuring a Zipformer speech encoder. It incorporates a stateless decoder that considers two preceding tokens and is optimized with a pruned-RNNT loss function. Experimental results show that our system achieves a word error rate (WER) of 27.14% and a character error rate (CER) of 20.45% on single-speaker tasks, demonstrating significant progress in VSR for Vietnamese.
format	Article
id	doaj-art-d93de22e364747cba55c1a6bf963ebb6
institution	OA Journals
issn	1687-5605
language	English
publishDate	2025-01-01
publisher	Wiley
record_format	Article
series	Modelling and Simulation in Engineering
spelling	doaj-art-d93de22e364747cba55c1a6bf963ebb62025-08-20T02:34:32ZengWileyModelling and Simulation in Engineering1687-56052025-01-01202510.1155/mse/2087573A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and ZipformerPhat Nguyen Huu0Thach Ho Sy1School of Electronic and Electrical EngineeringSchool of Electronic and Electrical EngineeringThis paper presents the first sentence-level visual speech recognition (VSR) system specifically designed for the Vietnamese language. We have developed a unique dataset comprising 115 h of video recordings from over 100 speakers, focusing on single-speaker scenarios. The proposed VSR system utilizes a ResNet3D architecture as the visual frontend, paired with a neural transducer framework featuring a Zipformer speech encoder. It incorporates a stateless decoder that considers two preceding tokens and is optimized with a pruned-RNNT loss function. Experimental results show that our system achieves a word error rate (WER) of 27.14% and a character error rate (CER) of 20.45% on single-speaker tasks, demonstrating significant progress in VSR for Vietnamese.http://dx.doi.org/10.1155/mse/2087573
spellingShingle	Phat Nguyen Huu Thach Ho Sy A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer Modelling and Simulation in Engineering
title	A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer
title_full	A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer
title_fullStr	A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer
title_full_unstemmed	A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer
title_short	A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer
title_sort	novel sentence level visual speech recognition system for vietnamese language using resnet3d and zipformer
url	http://dx.doi.org/10.1155/mse/2087573
work_keys_str_mv	AT phatnguyenhuu anovelsentencelevelvisualspeechrecognitionsystemforvietnameselanguageusingresnet3dandzipformer AT thachhosy anovelsentencelevelvisualspeechrecognitionsystemforvietnameselanguageusingresnet3dandzipformer AT phatnguyenhuu novelsentencelevelvisualspeechrecognitionsystemforvietnameselanguageusingresnet3dandzipformer AT thachhosy novelsentencelevelvisualspeechrecognitionsystemforvietnameselanguageusingresnet3dandzipformer

A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer

Similar Items