STDNet: Improved lip reading via short-term temporal dependency modeling

Background: Lip reading uses lip images for visual speech recognition. Deep-learning-based lip reading has greatly improved performance in current datasets; however, most existing research ignores the significance of short-term temporal dependencies of lip-shape variations between adjacent frames, w...

Full description

Saved in:
Bibliographic Details
Main Authors: Xiaoer Wu, Zhenhua Tan, Ziwei Cheng, Yuran Ru
Format: Article
Language:English
Published: KeAi Communications Co., Ltd. 2025-04-01
Series:Virtual Reality & Intelligent Hardware
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S209657962400038X
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850183020281069568
author Xiaoer Wu
Zhenhua Tan
Ziwei Cheng
Yuran Ru
author_facet Xiaoer Wu
Zhenhua Tan
Ziwei Cheng
Yuran Ru
author_sort Xiaoer Wu
collection DOAJ
description Background: Lip reading uses lip images for visual speech recognition. Deep-learning-based lip reading has greatly improved performance in current datasets; however, most existing research ignores the significance of short-term temporal dependencies of lip-shape variations between adjacent frames, which leaves space for further improvement in feature extraction. Methods: This article presents a spatiotemporal feature fusion network (STDNet) that compensates for the deficiencies of current lip-reading approaches in short-term temporal dependency modeling. Specifically, to distinguish more similar and intricate content, STDNet adds a temporal feature extraction branch based on a 3D-CNN, which enhances the learning of dynamic lip movements in adjacent frames while not affecting spatial feature extraction. In particular, we designed a local–temporal block, which aggregates interframe differences, strengthening the relationship between various local lip regions through multiscale convolution. We incorporated the squeeze-and-excitation mechanism into the Global-Temporal Block, which processes a single frame as an independent unitto learn temporal variations across the entire lip region more effectively. Furthermore, attention pooling was introduced to highlight meaningful frames containing key semantic information for the target word. Results: Experimental results demonstrated STDNet's superior performance on the LRW and LRW-1000, achieving word-level recognition accuracies of 90.2% and 53.56%, respectively. Extensive ablation experiments verified the rationality and effectiveness of its modules. Conclusions: The proposed model effectively addresses short-term temporal dependency limitations in lip reading, and improves the temporal robustness of the model against variable-length sequences. These advancements validate the importance of explicit short-term dynamics modeling for practical lip-reading systems.
format Article
id doaj-art-7de425792d7a4c7cad7ea6fe3d5f3337
institution OA Journals
issn 2096-5796
language English
publishDate 2025-04-01
publisher KeAi Communications Co., Ltd.
record_format Article
series Virtual Reality & Intelligent Hardware
spelling doaj-art-7de425792d7a4c7cad7ea6fe3d5f33372025-08-20T02:17:28ZengKeAi Communications Co., Ltd.Virtual Reality & Intelligent Hardware2096-57962025-04-017217318710.1016/j.vrih.2024.07.003STDNet: Improved lip reading via short-term temporal dependency modelingXiaoer Wu0Zhenhua Tan1Ziwei Cheng2Yuran Ru3Software College, Northeastern University, Shenyang 110819, ChinaFaculty of Software College, Northeastern University, Shenyang 110819, China; Corresponding author.Software College, Northeastern University, Shenyang 110819, ChinaSoftware College, Northeastern University, Shenyang 110819, ChinaBackground: Lip reading uses lip images for visual speech recognition. Deep-learning-based lip reading has greatly improved performance in current datasets; however, most existing research ignores the significance of short-term temporal dependencies of lip-shape variations between adjacent frames, which leaves space for further improvement in feature extraction. Methods: This article presents a spatiotemporal feature fusion network (STDNet) that compensates for the deficiencies of current lip-reading approaches in short-term temporal dependency modeling. Specifically, to distinguish more similar and intricate content, STDNet adds a temporal feature extraction branch based on a 3D-CNN, which enhances the learning of dynamic lip movements in adjacent frames while not affecting spatial feature extraction. In particular, we designed a local–temporal block, which aggregates interframe differences, strengthening the relationship between various local lip regions through multiscale convolution. We incorporated the squeeze-and-excitation mechanism into the Global-Temporal Block, which processes a single frame as an independent unitto learn temporal variations across the entire lip region more effectively. Furthermore, attention pooling was introduced to highlight meaningful frames containing key semantic information for the target word. Results: Experimental results demonstrated STDNet's superior performance on the LRW and LRW-1000, achieving word-level recognition accuracies of 90.2% and 53.56%, respectively. Extensive ablation experiments verified the rationality and effectiveness of its modules. Conclusions: The proposed model effectively addresses short-term temporal dependency limitations in lip reading, and improves the temporal robustness of the model against variable-length sequences. These advancements validate the importance of explicit short-term dynamics modeling for practical lip-reading systems.http://www.sciencedirect.com/science/article/pii/S209657962400038XLip readingSpatio-temporal feature fusionShort-term temporal dependency modeling
spellingShingle Xiaoer Wu
Zhenhua Tan
Ziwei Cheng
Yuran Ru
STDNet: Improved lip reading via short-term temporal dependency modeling
Virtual Reality & Intelligent Hardware
Lip reading
Spatio-temporal feature fusion
Short-term temporal dependency modeling
title STDNet: Improved lip reading via short-term temporal dependency modeling
title_full STDNet: Improved lip reading via short-term temporal dependency modeling
title_fullStr STDNet: Improved lip reading via short-term temporal dependency modeling
title_full_unstemmed STDNet: Improved lip reading via short-term temporal dependency modeling
title_short STDNet: Improved lip reading via short-term temporal dependency modeling
title_sort stdnet improved lip reading via short term temporal dependency modeling
topic Lip reading
Spatio-temporal feature fusion
Short-term temporal dependency modeling
url http://www.sciencedirect.com/science/article/pii/S209657962400038X
work_keys_str_mv AT xiaoerwu stdnetimprovedlipreadingviashorttermtemporaldependencymodeling
AT zhenhuatan stdnetimprovedlipreadingviashorttermtemporaldependencymodeling
AT ziweicheng stdnetimprovedlipreadingviashorttermtemporaldependencymodeling
AT yuranru stdnetimprovedlipreadingviashorttermtemporaldependencymodeling