Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement

Continuous Sign Language Recognition (CSLR) seeks to interpret the gestures used by people who are hard of hearing-mute individuals and translate them into natural language, thereby enhancing communication and interaction. A successful CSLR method relies on the continuous tracking of the presenter&a...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhen Wang, Dongyuan Li, Renhe Jiang, Manabu Okumura
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10829616/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841550761539928064
author Zhen Wang
Dongyuan Li
Renhe Jiang
Manabu Okumura
author_facet Zhen Wang
Dongyuan Li
Renhe Jiang
Manabu Okumura
author_sort Zhen Wang
collection DOAJ
description Continuous Sign Language Recognition (CSLR) seeks to interpret the gestures used by people who are hard of hearing-mute individuals and translate them into natural language, thereby enhancing communication and interaction. A successful CSLR method relies on the continuous tracking of the presenter’s gestures and facial movements. Existing CSLR methods struggle with fully leveraging fine-grained continuous frame information and often overlook the importance of multi-scale feature integration during decoding. To solve the above-mentioned issues, in this paper, we propose a spatial-temporal feature-enhanced network, called STNet for CSLR task. Firstly, for better continuous frame information exploration, based on the optimal transport algorithm, we first propose a spatial resonance module, which is used to extract the global common spatial features of two adjacent frames along the frame sequence. Secondly, we design a frame-wise loss to preserve and enhance the specific features of each frame. Lastly, to emphasize the multi-scale feature fusion, on the decoder side, we design a multi-temporal perception module, to allow each frame to focus on a larger range of other frames and enhance information interaction from different scales. Extensive experiments on three benchmark datasets including PHOENIX14, PHOENIX14-T, and CSL-Daily demonstrate that STNet consistently outperforms state-of-the-art methods, with a notable improvement of 2.9% in CSLR, showcasing its effectiveness and generalizability. Our approach provides a robust foundation for real-world applications such as sign language education and communication tools, while ablation and case studies highlight the impact of each module, paving the way for future research in CSLR.
format Article
id doaj-art-51cd6b223e1146159a1020d6c393b8b1
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-51cd6b223e1146159a1020d6c393b8b12025-01-10T00:01:29ZengIEEEIEEE Access2169-35362025-01-01135491550610.1109/ACCESS.2025.352633010829616Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature EnhancementZhen Wang0https://orcid.org/0009-0005-6450-3699Dongyuan Li1https://orcid.org/0000-0002-4462-3563Renhe Jiang2https://orcid.org/0000-0003-2593-4638Manabu Okumura3https://orcid.org/0009-0001-7730-1536School of Information and Communication Engineering, Tokyo Institute of Technology, Tokyo, JapanCenter for Spatial Information Science, The University of Tokyo, Tokyo, JapanCenter for Spatial Information Science, The University of Tokyo, Tokyo, JapanSchool of Information and Communication Engineering, Tokyo Institute of Technology, Tokyo, JapanContinuous Sign Language Recognition (CSLR) seeks to interpret the gestures used by people who are hard of hearing-mute individuals and translate them into natural language, thereby enhancing communication and interaction. A successful CSLR method relies on the continuous tracking of the presenter’s gestures and facial movements. Existing CSLR methods struggle with fully leveraging fine-grained continuous frame information and often overlook the importance of multi-scale feature integration during decoding. To solve the above-mentioned issues, in this paper, we propose a spatial-temporal feature-enhanced network, called STNet for CSLR task. Firstly, for better continuous frame information exploration, based on the optimal transport algorithm, we first propose a spatial resonance module, which is used to extract the global common spatial features of two adjacent frames along the frame sequence. Secondly, we design a frame-wise loss to preserve and enhance the specific features of each frame. Lastly, to emphasize the multi-scale feature fusion, on the decoder side, we design a multi-temporal perception module, to allow each frame to focus on a larger range of other frames and enhance information interaction from different scales. Extensive experiments on three benchmark datasets including PHOENIX14, PHOENIX14-T, and CSL-Daily demonstrate that STNet consistently outperforms state-of-the-art methods, with a notable improvement of 2.9% in CSLR, showcasing its effectiveness and generalizability. Our approach provides a robust foundation for real-world applications such as sign language education and communication tools, while ablation and case studies highlight the impact of each module, paving the way for future research in CSLR.https://ieeexplore.ieee.org/document/10829616/Sign language recognitioncomputer visionvideo analysisspatial-temporal datasets
spellingShingle Zhen Wang
Dongyuan Li
Renhe Jiang
Manabu Okumura
Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement
IEEE Access
Sign language recognition
computer vision
video analysis
spatial-temporal datasets
title Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement
title_full Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement
title_fullStr Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement
title_full_unstemmed Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement
title_short Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement
title_sort continuous sign language recognition with multi scale spatial temporal feature enhancement
topic Sign language recognition
computer vision
video analysis
spatial-temporal datasets
url https://ieeexplore.ieee.org/document/10829616/
work_keys_str_mv AT zhenwang continuoussignlanguagerecognitionwithmultiscalespatialtemporalfeatureenhancement
AT dongyuanli continuoussignlanguagerecognitionwithmultiscalespatialtemporalfeatureenhancement
AT renhejiang continuoussignlanguagerecognitionwithmultiscalespatialtemporalfeatureenhancement
AT manabuokumura continuoussignlanguagerecognitionwithmultiscalespatialtemporalfeatureenhancement