GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMamba

Abstract Detecting arbitrary-shaped text in natural scenes remains a significant challenge in deep learning research. Contemporary text detectors based on Convolutional Neural Networks face challenges in effectively modeling long-range dependencies. While Vision Transformers theoretically enable glo...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yingnan Zhao, Zheng Hu, Fangqi Ding, Jielin Jiang, Xiaolong Xu
Format:	Article
Language:	English
Published:	Springer 2025-06-01
Series:	Complex & Intelligent Systems
Subjects:	Computer vision Globally Deformable VMamba Attention mechanism Scene text detection
Online Access:	https://doi.org/10.1007/s40747-025-01987-6
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849235405563494400
author	Yingnan Zhao Zheng Hu Fangqi Ding Jielin Jiang Xiaolong Xu
author_facet	Yingnan Zhao Zheng Hu Fangqi Ding Jielin Jiang Xiaolong Xu
author_sort	Yingnan Zhao
collection	DOAJ
description	Abstract Detecting arbitrary-shaped text in natural scenes remains a significant challenge in deep learning research. Contemporary text detectors based on Convolutional Neural Networks face challenges in effectively modeling long-range dependencies. While Vision Transformers theoretically enable global context modeling via self-attention mechanisms, a computational framework designed for establishing effective long-range dependency modeling, their practical implementation is constrained by quadratic computational complexity in real-world scenarios. To address these challenges, this study proposes a novel scene text detector called GDText-VM (Globally Deformable Text-VMamba), developed using the deformable VMamba framework. This detector incorporates a global channel-spatial attention mechanism along with Fourier contour modeling. This approach enhances the capability to capture long-range dependencies, achieving a global receptive field and rapid convergence while maintaining linear computational complexity. Unlike the original VMamba, GDText-VM integrates deformable convolutions to enhance focus on local regions and reduces reliance on cross-shaped activation patterns. Additionally, to improve the capability of GDText-VM to fit text contours in the Fourier domain, this study introduces an innovative Global Attention Shuffle Module (GASM). This module facilitates the fusion of global channel and spatial features, effectively mitigating the impact of feature imbalance on fitting performance and significantly enhancing text detection accuracy. This study conducts comprehensive experiments on Total-Text, CTW1500, and ICDAR2015 to compare GDText-VM with classical scene text detection approaches. The results indicate that GDText-VM outperforms the state-of-the-art methods in terms of precision, recall, and F-measure, while maintaining efficient computation with 25.88M parameters and 40.83G FLOPs. Notably, GDText-VM achieves F-measure values of 88.5% on Total-Text, 88.9% on CTW1500, and 88.6% on ICDAR2015.
format	Article
id	doaj-art-cbd7a84cb82e4b9dbae3aa2f2dd17901
institution	Kabale University
issn	2199-4536 2198-6053
language	English
publishDate	2025-06-01
publisher	Springer
record_format	Article
series	Complex & Intelligent Systems
spelling	doaj-art-cbd7a84cb82e4b9dbae3aa2f2dd179012025-08-20T04:02:49ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-06-0111811910.1007/s40747-025-01987-6GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMambaYingnan Zhao0Zheng Hu1Fangqi Ding2Jielin Jiang3Xiaolong Xu4School of Computer and Science, Nanjing University of Information Science and TechnologySchool of Computer and Science, Nanjing University of Information Science and TechnologySchool of Computer and Science, Nanjing University of Information Science and TechnologySchool of Software, Nanjing University of Information Science and TechnologySchool of Software, Nanjing University of Information Science and TechnologyAbstract Detecting arbitrary-shaped text in natural scenes remains a significant challenge in deep learning research. Contemporary text detectors based on Convolutional Neural Networks face challenges in effectively modeling long-range dependencies. While Vision Transformers theoretically enable global context modeling via self-attention mechanisms, a computational framework designed for establishing effective long-range dependency modeling, their practical implementation is constrained by quadratic computational complexity in real-world scenarios. To address these challenges, this study proposes a novel scene text detector called GDText-VM (Globally Deformable Text-VMamba), developed using the deformable VMamba framework. This detector incorporates a global channel-spatial attention mechanism along with Fourier contour modeling. This approach enhances the capability to capture long-range dependencies, achieving a global receptive field and rapid convergence while maintaining linear computational complexity. Unlike the original VMamba, GDText-VM integrates deformable convolutions to enhance focus on local regions and reduces reliance on cross-shaped activation patterns. Additionally, to improve the capability of GDText-VM to fit text contours in the Fourier domain, this study introduces an innovative Global Attention Shuffle Module (GASM). This module facilitates the fusion of global channel and spatial features, effectively mitigating the impact of feature imbalance on fitting performance and significantly enhancing text detection accuracy. This study conducts comprehensive experiments on Total-Text, CTW1500, and ICDAR2015 to compare GDText-VM with classical scene text detection approaches. The results indicate that GDText-VM outperforms the state-of-the-art methods in terms of precision, recall, and F-measure, while maintaining efficient computation with 25.88M parameters and 40.83G FLOPs. Notably, GDText-VM achieves F-measure values of 88.5% on Total-Text, 88.9% on CTW1500, and 88.6% on ICDAR2015.https://doi.org/10.1007/s40747-025-01987-6Computer visionGlobally Deformable VMambaAttention mechanismScene text detection
spellingShingle	Yingnan Zhao Zheng Hu Fangqi Ding Jielin Jiang Xiaolong Xu GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMamba Complex & Intelligent Systems Computer vision Globally Deformable VMamba Attention mechanism Scene text detection
title	GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMamba
title_full	GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMamba
title_fullStr	GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMamba
title_full_unstemmed	GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMamba
title_short	GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMamba
title_sort	gdtext vm an arbitrary shaped scene text detector based on globally deformable vmamba
topic	Computer vision Globally Deformable VMamba Attention mechanism Scene text detection
url	https://doi.org/10.1007/s40747-025-01987-6
work_keys_str_mv	AT yingnanzhao gdtextvmanarbitraryshapedscenetextdetectorbasedongloballydeformablevmamba AT zhenghu gdtextvmanarbitraryshapedscenetextdetectorbasedongloballydeformablevmamba AT fangqiding gdtextvmanarbitraryshapedscenetextdetectorbasedongloballydeformablevmamba AT jielinjiang gdtextvmanarbitraryshapedscenetextdetectorbasedongloballydeformablevmamba AT xiaolongxu gdtextvmanarbitraryshapedscenetextdetectorbasedongloballydeformablevmamba

GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMamba

Similar Items