Beyond Granularity: Enhancing Continuous Sign Language Recognition with Granularity-Aware Feature Fusion and Attention Optimization

The advancement of deep learning techniques has significantly propelled the development of the continuous sign language recognition (cSLR) task. However, the spatial feature extraction of sign language videos in the RGB space tends to focus on the overall image information while neglecting the perce...

Full description

Saved in:
Bibliographic Details
Main Authors: Yao Du, Taiying Peng, Xiaohui Hu
Format: Article
Language:English
Published: MDPI AG 2024-10-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/14/19/8937
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850283909739184128
author Yao Du
Taiying Peng
Xiaohui Hu
author_facet Yao Du
Taiying Peng
Xiaohui Hu
author_sort Yao Du
collection DOAJ
description The advancement of deep learning techniques has significantly propelled the development of the continuous sign language recognition (cSLR) task. However, the spatial feature extraction of sign language videos in the RGB space tends to focus on the overall image information while neglecting the perception of traits at different granularities, such as eye gaze and lip shape, which are more detailed, or posture and gestures, which are more macroscopic. Exploring the efficient fusion of visual information of different granularities is crucial for accurate sign language recognition. In addition, applying a vanilla Transformer to sequence modeling in cSLR exhibits weak performance because specific video frames could interfere with the attention mechanism. These limitations constrain the capability to understand potential semantic characteristics. We introduce a feature fusion method for integrating visual features of disparate granularities and refine the metric of attention to enhance the Transformer’s comprehension of video content. Specifically, we extract CNN feature maps with varying receptive fields and employ a self-attention mechanism to fuse feature maps of different granularities, thereby obtaining multi-scale spatial features of the sign language framework. As for video modeling, we first analyze why the vanilla Transformer failed in cSLR and observe that the magnitude of the feature vectors of video frames could interfere with the distribution of attention weights. Therefore, we utilize the Euclidean distance among vectors to measure the attention weights instead of scaled-dot to enhance dynamic temporal modeling capabilities. Finally, we integrate the two components to construct the model <i>MSF-ET</i> (Multi-Scaled feature Fusion–Euclidean Transformer) for cSLR and train the model end-to-end. We perform experiments on large-scale cSLR benchmarks—PHOENIX-2014 and Chinese Sign Language (CSL)—to validate the effectiveness.
format Article
id doaj-art-4390c01f17da4a4db8c47d3013a91e13
institution OA Journals
issn 2076-3417
language English
publishDate 2024-10-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-4390c01f17da4a4db8c47d3013a91e132025-08-20T01:47:41ZengMDPI AGApplied Sciences2076-34172024-10-011419893710.3390/app14198937Beyond Granularity: Enhancing Continuous Sign Language Recognition with Granularity-Aware Feature Fusion and Attention OptimizationYao Du0Taiying Peng1Xiaohui Hu2School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, ChinaSchool of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, ChinaScience and Technology on Integrated Information System Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing 100190, ChinaThe advancement of deep learning techniques has significantly propelled the development of the continuous sign language recognition (cSLR) task. However, the spatial feature extraction of sign language videos in the RGB space tends to focus on the overall image information while neglecting the perception of traits at different granularities, such as eye gaze and lip shape, which are more detailed, or posture and gestures, which are more macroscopic. Exploring the efficient fusion of visual information of different granularities is crucial for accurate sign language recognition. In addition, applying a vanilla Transformer to sequence modeling in cSLR exhibits weak performance because specific video frames could interfere with the attention mechanism. These limitations constrain the capability to understand potential semantic characteristics. We introduce a feature fusion method for integrating visual features of disparate granularities and refine the metric of attention to enhance the Transformer’s comprehension of video content. Specifically, we extract CNN feature maps with varying receptive fields and employ a self-attention mechanism to fuse feature maps of different granularities, thereby obtaining multi-scale spatial features of the sign language framework. As for video modeling, we first analyze why the vanilla Transformer failed in cSLR and observe that the magnitude of the feature vectors of video frames could interfere with the distribution of attention weights. Therefore, we utilize the Euclidean distance among vectors to measure the attention weights instead of scaled-dot to enhance dynamic temporal modeling capabilities. Finally, we integrate the two components to construct the model <i>MSF-ET</i> (Multi-Scaled feature Fusion–Euclidean Transformer) for cSLR and train the model end-to-end. We perform experiments on large-scale cSLR benchmarks—PHOENIX-2014 and Chinese Sign Language (CSL)—to validate the effectiveness.https://www.mdpi.com/2076-3417/14/19/8937continuous sign language recognitionmulti-scaled feature fusionself-attentionTransformer
spellingShingle Yao Du
Taiying Peng
Xiaohui Hu
Beyond Granularity: Enhancing Continuous Sign Language Recognition with Granularity-Aware Feature Fusion and Attention Optimization
Applied Sciences
continuous sign language recognition
multi-scaled feature fusion
self-attention
Transformer
title Beyond Granularity: Enhancing Continuous Sign Language Recognition with Granularity-Aware Feature Fusion and Attention Optimization
title_full Beyond Granularity: Enhancing Continuous Sign Language Recognition with Granularity-Aware Feature Fusion and Attention Optimization
title_fullStr Beyond Granularity: Enhancing Continuous Sign Language Recognition with Granularity-Aware Feature Fusion and Attention Optimization
title_full_unstemmed Beyond Granularity: Enhancing Continuous Sign Language Recognition with Granularity-Aware Feature Fusion and Attention Optimization
title_short Beyond Granularity: Enhancing Continuous Sign Language Recognition with Granularity-Aware Feature Fusion and Attention Optimization
title_sort beyond granularity enhancing continuous sign language recognition with granularity aware feature fusion and attention optimization
topic continuous sign language recognition
multi-scaled feature fusion
self-attention
Transformer
url https://www.mdpi.com/2076-3417/14/19/8937
work_keys_str_mv AT yaodu beyondgranularityenhancingcontinuoussignlanguagerecognitionwithgranularityawarefeaturefusionandattentionoptimization
AT taiyingpeng beyondgranularityenhancingcontinuoussignlanguagerecognitionwithgranularityawarefeaturefusionandattentionoptimization
AT xiaohuihu beyondgranularityenhancingcontinuoussignlanguagerecognitionwithgranularityawarefeaturefusionandattentionoptimization