GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMamba

Abstract Detecting arbitrary-shaped text in natural scenes remains a significant challenge in deep learning research. Contemporary text detectors based on Convolutional Neural Networks face challenges in effectively modeling long-range dependencies. While Vision Transformers theoretically enable glo...

Full description

Saved in:
Bibliographic Details
Main Authors: Yingnan Zhao, Zheng Hu, Fangqi Ding, Jielin Jiang, Xiaolong Xu
Format: Article
Language:English
Published: Springer 2025-06-01
Series:Complex & Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1007/s40747-025-01987-6
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849235405563494400
author Yingnan Zhao
Zheng Hu
Fangqi Ding
Jielin Jiang
Xiaolong Xu
author_facet Yingnan Zhao
Zheng Hu
Fangqi Ding
Jielin Jiang
Xiaolong Xu
author_sort Yingnan Zhao
collection DOAJ
description Abstract Detecting arbitrary-shaped text in natural scenes remains a significant challenge in deep learning research. Contemporary text detectors based on Convolutional Neural Networks face challenges in effectively modeling long-range dependencies. While Vision Transformers theoretically enable global context modeling via self-attention mechanisms, a computational framework designed for establishing effective long-range dependency modeling, their practical implementation is constrained by quadratic computational complexity in real-world scenarios. To address these challenges, this study proposes a novel scene text detector called GDText-VM (Globally Deformable Text-VMamba), developed using the deformable VMamba framework. This detector incorporates a global channel-spatial attention mechanism along with Fourier contour modeling. This approach enhances the capability to capture long-range dependencies, achieving a global receptive field and rapid convergence while maintaining linear computational complexity. Unlike the original VMamba, GDText-VM integrates deformable convolutions to enhance focus on local regions and reduces reliance on cross-shaped activation patterns. Additionally, to improve the capability of GDText-VM to fit text contours in the Fourier domain, this study introduces an innovative Global Attention Shuffle Module (GASM). This module facilitates the fusion of global channel and spatial features, effectively mitigating the impact of feature imbalance on fitting performance and significantly enhancing text detection accuracy. This study conducts comprehensive experiments on Total-Text, CTW1500, and ICDAR2015 to compare GDText-VM with classical scene text detection approaches. The results indicate that GDText-VM outperforms the state-of-the-art methods in terms of precision, recall, and F-measure, while maintaining efficient computation with 25.88M parameters and 40.83G FLOPs. Notably, GDText-VM achieves F-measure values of 88.5% on Total-Text, 88.9% on CTW1500, and 88.6% on ICDAR2015.
format Article
id doaj-art-cbd7a84cb82e4b9dbae3aa2f2dd17901
institution Kabale University
issn 2199-4536
2198-6053
language English
publishDate 2025-06-01
publisher Springer
record_format Article
series Complex & Intelligent Systems
spelling doaj-art-cbd7a84cb82e4b9dbae3aa2f2dd179012025-08-20T04:02:49ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-06-0111811910.1007/s40747-025-01987-6GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMambaYingnan Zhao0Zheng Hu1Fangqi Ding2Jielin Jiang3Xiaolong Xu4School of Computer and Science, Nanjing University of Information Science and TechnologySchool of Computer and Science, Nanjing University of Information Science and TechnologySchool of Computer and Science, Nanjing University of Information Science and TechnologySchool of Software, Nanjing University of Information Science and TechnologySchool of Software, Nanjing University of Information Science and TechnologyAbstract Detecting arbitrary-shaped text in natural scenes remains a significant challenge in deep learning research. Contemporary text detectors based on Convolutional Neural Networks face challenges in effectively modeling long-range dependencies. While Vision Transformers theoretically enable global context modeling via self-attention mechanisms, a computational framework designed for establishing effective long-range dependency modeling, their practical implementation is constrained by quadratic computational complexity in real-world scenarios. To address these challenges, this study proposes a novel scene text detector called GDText-VM (Globally Deformable Text-VMamba), developed using the deformable VMamba framework. This detector incorporates a global channel-spatial attention mechanism along with Fourier contour modeling. This approach enhances the capability to capture long-range dependencies, achieving a global receptive field and rapid convergence while maintaining linear computational complexity. Unlike the original VMamba, GDText-VM integrates deformable convolutions to enhance focus on local regions and reduces reliance on cross-shaped activation patterns. Additionally, to improve the capability of GDText-VM to fit text contours in the Fourier domain, this study introduces an innovative Global Attention Shuffle Module (GASM). This module facilitates the fusion of global channel and spatial features, effectively mitigating the impact of feature imbalance on fitting performance and significantly enhancing text detection accuracy. This study conducts comprehensive experiments on Total-Text, CTW1500, and ICDAR2015 to compare GDText-VM with classical scene text detection approaches. The results indicate that GDText-VM outperforms the state-of-the-art methods in terms of precision, recall, and F-measure, while maintaining efficient computation with 25.88M parameters and 40.83G FLOPs. Notably, GDText-VM achieves F-measure values of 88.5% on Total-Text, 88.9% on CTW1500, and 88.6% on ICDAR2015.https://doi.org/10.1007/s40747-025-01987-6Computer visionGlobally Deformable VMambaAttention mechanismScene text detection
spellingShingle Yingnan Zhao
Zheng Hu
Fangqi Ding
Jielin Jiang
Xiaolong Xu
GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMamba
Complex & Intelligent Systems
Computer vision
Globally Deformable VMamba
Attention mechanism
Scene text detection
title GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMamba
title_full GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMamba
title_fullStr GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMamba
title_full_unstemmed GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMamba
title_short GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMamba
title_sort gdtext vm an arbitrary shaped scene text detector based on globally deformable vmamba
topic Computer vision
Globally Deformable VMamba
Attention mechanism
Scene text detection
url https://doi.org/10.1007/s40747-025-01987-6
work_keys_str_mv AT yingnanzhao gdtextvmanarbitraryshapedscenetextdetectorbasedongloballydeformablevmamba
AT zhenghu gdtextvmanarbitraryshapedscenetextdetectorbasedongloballydeformablevmamba
AT fangqiding gdtextvmanarbitraryshapedscenetextdetectorbasedongloballydeformablevmamba
AT jielinjiang gdtextvmanarbitraryshapedscenetextdetectorbasedongloballydeformablevmamba
AT xiaolongxu gdtextvmanarbitraryshapedscenetextdetectorbasedongloballydeformablevmamba