Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement
Continuous Sign Language Recognition (CSLR) seeks to interpret the gestures used by people who are hard of hearing-mute individuals and translate them into natural language, thereby enhancing communication and interaction. A successful CSLR method relies on the continuous tracking of the presenter&a...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10829616/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841550761539928064 |
---|---|
author | Zhen Wang Dongyuan Li Renhe Jiang Manabu Okumura |
author_facet | Zhen Wang Dongyuan Li Renhe Jiang Manabu Okumura |
author_sort | Zhen Wang |
collection | DOAJ |
description | Continuous Sign Language Recognition (CSLR) seeks to interpret the gestures used by people who are hard of hearing-mute individuals and translate them into natural language, thereby enhancing communication and interaction. A successful CSLR method relies on the continuous tracking of the presenter’s gestures and facial movements. Existing CSLR methods struggle with fully leveraging fine-grained continuous frame information and often overlook the importance of multi-scale feature integration during decoding. To solve the above-mentioned issues, in this paper, we propose a spatial-temporal feature-enhanced network, called STNet for CSLR task. Firstly, for better continuous frame information exploration, based on the optimal transport algorithm, we first propose a spatial resonance module, which is used to extract the global common spatial features of two adjacent frames along the frame sequence. Secondly, we design a frame-wise loss to preserve and enhance the specific features of each frame. Lastly, to emphasize the multi-scale feature fusion, on the decoder side, we design a multi-temporal perception module, to allow each frame to focus on a larger range of other frames and enhance information interaction from different scales. Extensive experiments on three benchmark datasets including PHOENIX14, PHOENIX14-T, and CSL-Daily demonstrate that STNet consistently outperforms state-of-the-art methods, with a notable improvement of 2.9% in CSLR, showcasing its effectiveness and generalizability. Our approach provides a robust foundation for real-world applications such as sign language education and communication tools, while ablation and case studies highlight the impact of each module, paving the way for future research in CSLR. |
format | Article |
id | doaj-art-51cd6b223e1146159a1020d6c393b8b1 |
institution | Kabale University |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-51cd6b223e1146159a1020d6c393b8b12025-01-10T00:01:29ZengIEEEIEEE Access2169-35362025-01-01135491550610.1109/ACCESS.2025.352633010829616Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature EnhancementZhen Wang0https://orcid.org/0009-0005-6450-3699Dongyuan Li1https://orcid.org/0000-0002-4462-3563Renhe Jiang2https://orcid.org/0000-0003-2593-4638Manabu Okumura3https://orcid.org/0009-0001-7730-1536School of Information and Communication Engineering, Tokyo Institute of Technology, Tokyo, JapanCenter for Spatial Information Science, The University of Tokyo, Tokyo, JapanCenter for Spatial Information Science, The University of Tokyo, Tokyo, JapanSchool of Information and Communication Engineering, Tokyo Institute of Technology, Tokyo, JapanContinuous Sign Language Recognition (CSLR) seeks to interpret the gestures used by people who are hard of hearing-mute individuals and translate them into natural language, thereby enhancing communication and interaction. A successful CSLR method relies on the continuous tracking of the presenter’s gestures and facial movements. Existing CSLR methods struggle with fully leveraging fine-grained continuous frame information and often overlook the importance of multi-scale feature integration during decoding. To solve the above-mentioned issues, in this paper, we propose a spatial-temporal feature-enhanced network, called STNet for CSLR task. Firstly, for better continuous frame information exploration, based on the optimal transport algorithm, we first propose a spatial resonance module, which is used to extract the global common spatial features of two adjacent frames along the frame sequence. Secondly, we design a frame-wise loss to preserve and enhance the specific features of each frame. Lastly, to emphasize the multi-scale feature fusion, on the decoder side, we design a multi-temporal perception module, to allow each frame to focus on a larger range of other frames and enhance information interaction from different scales. Extensive experiments on three benchmark datasets including PHOENIX14, PHOENIX14-T, and CSL-Daily demonstrate that STNet consistently outperforms state-of-the-art methods, with a notable improvement of 2.9% in CSLR, showcasing its effectiveness and generalizability. Our approach provides a robust foundation for real-world applications such as sign language education and communication tools, while ablation and case studies highlight the impact of each module, paving the way for future research in CSLR.https://ieeexplore.ieee.org/document/10829616/Sign language recognitioncomputer visionvideo analysisspatial-temporal datasets |
spellingShingle | Zhen Wang Dongyuan Li Renhe Jiang Manabu Okumura Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement IEEE Access Sign language recognition computer vision video analysis spatial-temporal datasets |
title | Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement |
title_full | Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement |
title_fullStr | Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement |
title_full_unstemmed | Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement |
title_short | Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement |
title_sort | continuous sign language recognition with multi scale spatial temporal feature enhancement |
topic | Sign language recognition computer vision video analysis spatial-temporal datasets |
url | https://ieeexplore.ieee.org/document/10829616/ |
work_keys_str_mv | AT zhenwang continuoussignlanguagerecognitionwithmultiscalespatialtemporalfeatureenhancement AT dongyuanli continuoussignlanguagerecognitionwithmultiscalespatialtemporalfeatureenhancement AT renhejiang continuoussignlanguagerecognitionwithmultiscalespatialtemporalfeatureenhancement AT manabuokumura continuoussignlanguagerecognitionwithmultiscalespatialtemporalfeatureenhancement |