Cross-attention multi branch for Vietnamese sign language recognition: CrossViViT

Abstract Sign language serves as the primary communication medium for individuals who are deaf or hard of hearing. Despite its critical importance, barriers persist in communication between the deaf community and the broader society, primarily due to limited sign language proficiency among the gener...

Full description

Saved in:
Bibliographic Details
Main Authors: Minh Hoang Chu, Hoang Diep Nguyen, Thi Ngoc Anh Nguyen, Hoai Nam Vu
Format: Article
Language:English
Published: Springer 2025-07-01
Series:Discover Computing
Subjects:
Online Access:https://doi.org/10.1007/s10791-025-09669-0
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Sign language serves as the primary communication medium for individuals who are deaf or hard of hearing. Despite its critical importance, barriers persist in communication between the deaf community and the broader society, primarily due to limited sign language proficiency among the general population. While automated sign language recognition (ASLR) systems leveraging machine learning technologies offer a promising solution, existing approaches face challenges in optimizing the trade-off between computational efficiency and recognition accuracy. This study presents CrossViViT, a novel architecture that integrates cross-attention mechanisms with video vision Transformer networks to address these limitations. Drawing inspiration from multi-branch network architectures that combine diverse feature perspectives for flexible image recognition, our approach achieves both computational efficiency and high accuracy. The proposed model demonstrates exceptional performance on the Vietnamese Sign Language (VSL) dataset, achieving 92.47% accuracy in recognizing 50 distinct gestures across 8510 videos while maintaining computational efficiency at approximately 629 FLOPS.
ISSN:2948-2992