A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer
This paper presents the first sentence-level visual speech recognition (VSR) system specifically designed for the Vietnamese language. We have developed a unique dataset comprising 115 h of video recordings from over 100 speakers, focusing on single-speaker scenarios. The proposed VSR system utilize...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Wiley
2025-01-01
|
| Series: | Modelling and Simulation in Engineering |
| Online Access: | http://dx.doi.org/10.1155/mse/2087573 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850123730433343488 |
|---|---|
| author | Phat Nguyen Huu Thach Ho Sy |
| author_facet | Phat Nguyen Huu Thach Ho Sy |
| author_sort | Phat Nguyen Huu |
| collection | DOAJ |
| description | This paper presents the first sentence-level visual speech recognition (VSR) system specifically designed for the Vietnamese language. We have developed a unique dataset comprising 115 h of video recordings from over 100 speakers, focusing on single-speaker scenarios. The proposed VSR system utilizes a ResNet3D architecture as the visual frontend, paired with a neural transducer framework featuring a Zipformer speech encoder. It incorporates a stateless decoder that considers two preceding tokens and is optimized with a pruned-RNNT loss function. Experimental results show that our system achieves a word error rate (WER) of 27.14% and a character error rate (CER) of 20.45% on single-speaker tasks, demonstrating significant progress in VSR for Vietnamese. |
| format | Article |
| id | doaj-art-d93de22e364747cba55c1a6bf963ebb6 |
| institution | OA Journals |
| issn | 1687-5605 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | Wiley |
| record_format | Article |
| series | Modelling and Simulation in Engineering |
| spelling | doaj-art-d93de22e364747cba55c1a6bf963ebb62025-08-20T02:34:32ZengWileyModelling and Simulation in Engineering1687-56052025-01-01202510.1155/mse/2087573A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and ZipformerPhat Nguyen Huu0Thach Ho Sy1School of Electronic and Electrical EngineeringSchool of Electronic and Electrical EngineeringThis paper presents the first sentence-level visual speech recognition (VSR) system specifically designed for the Vietnamese language. We have developed a unique dataset comprising 115 h of video recordings from over 100 speakers, focusing on single-speaker scenarios. The proposed VSR system utilizes a ResNet3D architecture as the visual frontend, paired with a neural transducer framework featuring a Zipformer speech encoder. It incorporates a stateless decoder that considers two preceding tokens and is optimized with a pruned-RNNT loss function. Experimental results show that our system achieves a word error rate (WER) of 27.14% and a character error rate (CER) of 20.45% on single-speaker tasks, demonstrating significant progress in VSR for Vietnamese.http://dx.doi.org/10.1155/mse/2087573 |
| spellingShingle | Phat Nguyen Huu Thach Ho Sy A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer Modelling and Simulation in Engineering |
| title | A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer |
| title_full | A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer |
| title_fullStr | A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer |
| title_full_unstemmed | A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer |
| title_short | A Novel Sentence-Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer |
| title_sort | novel sentence level visual speech recognition system for vietnamese language using resnet3d and zipformer |
| url | http://dx.doi.org/10.1155/mse/2087573 |
| work_keys_str_mv | AT phatnguyenhuu anovelsentencelevelvisualspeechrecognitionsystemforvietnameselanguageusingresnet3dandzipformer AT thachhosy anovelsentencelevelvisualspeechrecognitionsystemforvietnameselanguageusingresnet3dandzipformer AT phatnguyenhuu novelsentencelevelvisualspeechrecognitionsystemforvietnameselanguageusingresnet3dandzipformer AT thachhosy novelsentencelevelvisualspeechrecognitionsystemforvietnameselanguageusingresnet3dandzipformer |