Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement

Continuous Sign Language Recognition (CSLR) seeks to interpret the gestures used by people who are hard of hearing-mute individuals and translate them into natural language, thereby enhancing communication and interaction. A successful CSLR method relies on the continuous tracking of the presenter&a...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhen Wang, Dongyuan Li, Renhe Jiang, Manabu Okumura
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Sign language recognition computer vision video analysis spatial-temporal datasets
Online Access:	https://ieeexplore.ieee.org/document/10829616/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841550761539928064
author	Zhen Wang Dongyuan Li Renhe Jiang Manabu Okumura
author_facet	Zhen Wang Dongyuan Li Renhe Jiang Manabu Okumura
author_sort	Zhen Wang
collection	DOAJ
description	Continuous Sign Language Recognition (CSLR) seeks to interpret the gestures used by people who are hard of hearing-mute individuals and translate them into natural language, thereby enhancing communication and interaction. A successful CSLR method relies on the continuous tracking of the presenter’s gestures and facial movements. Existing CSLR methods struggle with fully leveraging fine-grained continuous frame information and often overlook the importance of multi-scale feature integration during decoding. To solve the above-mentioned issues, in this paper, we propose a spatial-temporal feature-enhanced network, called STNet for CSLR task. Firstly, for better continuous frame information exploration, based on the optimal transport algorithm, we first propose a spatial resonance module, which is used to extract the global common spatial features of two adjacent frames along the frame sequence. Secondly, we design a frame-wise loss to preserve and enhance the specific features of each frame. Lastly, to emphasize the multi-scale feature fusion, on the decoder side, we design a multi-temporal perception module, to allow each frame to focus on a larger range of other frames and enhance information interaction from different scales. Extensive experiments on three benchmark datasets including PHOENIX14, PHOENIX14-T, and CSL-Daily demonstrate that STNet consistently outperforms state-of-the-art methods, with a notable improvement of 2.9% in CSLR, showcasing its effectiveness and generalizability. Our approach provides a robust foundation for real-world applications such as sign language education and communication tools, while ablation and case studies highlight the impact of each module, paving the way for future research in CSLR.
format	Article
id	doaj-art-51cd6b223e1146159a1020d6c393b8b1
institution	Kabale University
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-51cd6b223e1146159a1020d6c393b8b12025-01-10T00:01:29ZengIEEEIEEE Access2169-35362025-01-01135491550610.1109/ACCESS.2025.352633010829616Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature EnhancementZhen Wang0https://orcid.org/0009-0005-6450-3699Dongyuan Li1https://orcid.org/0000-0002-4462-3563Renhe Jiang2https://orcid.org/0000-0003-2593-4638Manabu Okumura3https://orcid.org/0009-0001-7730-1536School of Information and Communication Engineering, Tokyo Institute of Technology, Tokyo, JapanCenter for Spatial Information Science, The University of Tokyo, Tokyo, JapanCenter for Spatial Information Science, The University of Tokyo, Tokyo, JapanSchool of Information and Communication Engineering, Tokyo Institute of Technology, Tokyo, JapanContinuous Sign Language Recognition (CSLR) seeks to interpret the gestures used by people who are hard of hearing-mute individuals and translate them into natural language, thereby enhancing communication and interaction. A successful CSLR method relies on the continuous tracking of the presenter’s gestures and facial movements. Existing CSLR methods struggle with fully leveraging fine-grained continuous frame information and often overlook the importance of multi-scale feature integration during decoding. To solve the above-mentioned issues, in this paper, we propose a spatial-temporal feature-enhanced network, called STNet for CSLR task. Firstly, for better continuous frame information exploration, based on the optimal transport algorithm, we first propose a spatial resonance module, which is used to extract the global common spatial features of two adjacent frames along the frame sequence. Secondly, we design a frame-wise loss to preserve and enhance the specific features of each frame. Lastly, to emphasize the multi-scale feature fusion, on the decoder side, we design a multi-temporal perception module, to allow each frame to focus on a larger range of other frames and enhance information interaction from different scales. Extensive experiments on three benchmark datasets including PHOENIX14, PHOENIX14-T, and CSL-Daily demonstrate that STNet consistently outperforms state-of-the-art methods, with a notable improvement of 2.9% in CSLR, showcasing its effectiveness and generalizability. Our approach provides a robust foundation for real-world applications such as sign language education and communication tools, while ablation and case studies highlight the impact of each module, paving the way for future research in CSLR.https://ieeexplore.ieee.org/document/10829616/Sign language recognitioncomputer visionvideo analysisspatial-temporal datasets
spellingShingle	Zhen Wang Dongyuan Li Renhe Jiang Manabu Okumura Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement IEEE Access Sign language recognition computer vision video analysis spatial-temporal datasets
title	Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement
title_full	Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement
title_fullStr	Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement
title_full_unstemmed	Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement
title_short	Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement
title_sort	continuous sign language recognition with multi scale spatial temporal feature enhancement
topic	Sign language recognition computer vision video analysis spatial-temporal datasets
url	https://ieeexplore.ieee.org/document/10829616/
work_keys_str_mv	AT zhenwang continuoussignlanguagerecognitionwithmultiscalespatialtemporalfeatureenhancement AT dongyuanli continuoussignlanguagerecognitionwithmultiscalespatialtemporalfeatureenhancement AT renhejiang continuoussignlanguagerecognitionwithmultiscalespatialtemporalfeatureenhancement AT manabuokumura continuoussignlanguagerecognitionwithmultiscalespatialtemporalfeatureenhancement

Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement

Similar Items