Cross-attention multi branch for Vietnamese sign language recognition: CrossViViT
Abstract Sign language serves as the primary communication medium for individuals who are deaf or hard of hearing. Despite its critical importance, barriers persist in communication between the deaf community and the broader society, primarily due to limited sign language proficiency among the gener...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2025-07-01
|
| Series: | Discover Computing |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s10791-025-09669-0 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract Sign language serves as the primary communication medium for individuals who are deaf or hard of hearing. Despite its critical importance, barriers persist in communication between the deaf community and the broader society, primarily due to limited sign language proficiency among the general population. While automated sign language recognition (ASLR) systems leveraging machine learning technologies offer a promising solution, existing approaches face challenges in optimizing the trade-off between computational efficiency and recognition accuracy. This study presents CrossViViT, a novel architecture that integrates cross-attention mechanisms with video vision Transformer networks to address these limitations. Drawing inspiration from multi-branch network architectures that combine diverse feature perspectives for flexible image recognition, our approach achieves both computational efficiency and high accuracy. The proposed model demonstrates exceptional performance on the Vietnamese Sign Language (VSL) dataset, achieving 92.47% accuracy in recognizing 50 distinct gestures across 8510 videos while maintaining computational efficiency at approximately 629 FLOPS. |
|---|---|
| ISSN: | 2948-2992 |