Head and Hands Tunneling Pipeline for Enhancing Sign Language Recognition

Sign Language Recognition (SLR) presents a significant challenge as a fine-grained, scene- and subject-invariant video classification task, primarily relying on hand gestures and facial expressions to convey meaning. Vision foundation models, such as Vision Transformers (ViTs), trained on general hu...

Full description

Saved in:
Bibliographic Details
Main Authors: Ganzorig Batnasan, Munkh-Erdene Otgonbold, Qurban Ali Memon, Timothy K. Shih, Munkhjargal Gochoo
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11087542/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850071451569225728
author Ganzorig Batnasan
Munkh-Erdene Otgonbold
Qurban Ali Memon
Timothy K. Shih
Munkhjargal Gochoo
author_facet Ganzorig Batnasan
Munkh-Erdene Otgonbold
Qurban Ali Memon
Timothy K. Shih
Munkhjargal Gochoo
author_sort Ganzorig Batnasan
collection DOAJ
description Sign Language Recognition (SLR) presents a significant challenge as a fine-grained, scene- and subject-invariant video classification task, primarily relying on hand gestures and facial expressions to convey meaning. Vision foundation models, such as Vision Transformers (ViTs), trained on general human action recognition datasets, often struggle to capture the nuanced features of signs. We highlight two main challenges: 1) the loss of critical spatial features in the head and hand regions due to video downscaling during preprocessing, and 2) the lack of sufficient domain-specific knowledge of sign gestures in ViTs. To tackle these, we propose a pipeline comprising our Head & Hands Tunneling (H&HT) preprocessor and a domain-specifically pre-trained 32-frame ViT classifier. The H&HT preprocessor, incorporating the MediaPipe pose predictor, maximizes the preservation of critical spatial details from the signer’s head and hands in raw sign language videos. When the ViT model is pre-trained on a domain-specific, large-scale SLR dataset, the two parts complement each other. As a result, the 32-frame H&HT pipeline achieves a Top-1 accuracy of 62.82% on the WLASL2000 benchmark, surpassing the performance of the 32-frame models and ranking second among the 64-frame models. We also provide benchmarking results on the ASL-Citizen dataset and two revised versions of the WLASL2000 dataset. All weights and codes are available in this link.
format Article
id doaj-art-04c791ade48c4176bea9d47fbb8a23d5
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-04c791ade48c4176bea9d47fbb8a23d52025-08-20T02:47:18ZengIEEEIEEE Access2169-35362025-01-011312792612794010.1109/ACCESS.2025.359112311087542Head and Hands Tunneling Pipeline for Enhancing Sign Language RecognitionGanzorig Batnasan0Munkh-Erdene Otgonbold1Qurban Ali Memon2https://orcid.org/0000-0003-4129-3025Timothy K. Shih3https://orcid.org/0000-0003-4154-4752Munkhjargal Gochoo4https://orcid.org/0000-0002-6613-7435Department of Computer Science and Software Engineering, UAEU, Al Ain, United Arab EmiratesDepartment of Computer Science and Software Engineering, UAEU, Al Ain, United Arab EmiratesDepartment of Electrical and Communication Engineering, UAEU, Al Ain, United Arab EmiratesCollege of EECS, National Central University, Taoyuan, TaiwanDepartment of Computer Science and Software Engineering, UAEU, Al Ain, United Arab EmiratesSign Language Recognition (SLR) presents a significant challenge as a fine-grained, scene- and subject-invariant video classification task, primarily relying on hand gestures and facial expressions to convey meaning. Vision foundation models, such as Vision Transformers (ViTs), trained on general human action recognition datasets, often struggle to capture the nuanced features of signs. We highlight two main challenges: 1) the loss of critical spatial features in the head and hand regions due to video downscaling during preprocessing, and 2) the lack of sufficient domain-specific knowledge of sign gestures in ViTs. To tackle these, we propose a pipeline comprising our Head & Hands Tunneling (H&HT) preprocessor and a domain-specifically pre-trained 32-frame ViT classifier. The H&HT preprocessor, incorporating the MediaPipe pose predictor, maximizes the preservation of critical spatial details from the signer’s head and hands in raw sign language videos. When the ViT model is pre-trained on a domain-specific, large-scale SLR dataset, the two parts complement each other. As a result, the 32-frame H&HT pipeline achieves a Top-1 accuracy of 62.82% on the WLASL2000 benchmark, surpassing the performance of the 32-frame models and ranking second among the 64-frame models. We also provide benchmarking results on the ASL-Citizen dataset and two revised versions of the WLASL2000 dataset. All weights and codes are available in this link.https://ieeexplore.ieee.org/document/11087542/Sign language recognitionfoundation modelViTWLASL2000
spellingShingle Ganzorig Batnasan
Munkh-Erdene Otgonbold
Qurban Ali Memon
Timothy K. Shih
Munkhjargal Gochoo
Head and Hands Tunneling Pipeline for Enhancing Sign Language Recognition
IEEE Access
Sign language recognition
foundation model
ViT
WLASL2000
title Head and Hands Tunneling Pipeline for Enhancing Sign Language Recognition
title_full Head and Hands Tunneling Pipeline for Enhancing Sign Language Recognition
title_fullStr Head and Hands Tunneling Pipeline for Enhancing Sign Language Recognition
title_full_unstemmed Head and Hands Tunneling Pipeline for Enhancing Sign Language Recognition
title_short Head and Hands Tunneling Pipeline for Enhancing Sign Language Recognition
title_sort head and hands tunneling pipeline for enhancing sign language recognition
topic Sign language recognition
foundation model
ViT
WLASL2000
url https://ieeexplore.ieee.org/document/11087542/
work_keys_str_mv AT ganzorigbatnasan headandhandstunnelingpipelineforenhancingsignlanguagerecognition
AT munkherdeneotgonbold headandhandstunnelingpipelineforenhancingsignlanguagerecognition
AT qurbanalimemon headandhandstunnelingpipelineforenhancingsignlanguagerecognition
AT timothykshih headandhandstunnelingpipelineforenhancingsignlanguagerecognition
AT munkhjargalgochoo headandhandstunnelingpipelineforenhancingsignlanguagerecognition