Apvit: ViT with adaptive patches for scene text recognition

Abstract Scene texts in nature exhibit varied colors, which serve as a significant distinguishing feature that effectively suppresses background interference. In this study, color clustering is utilized as a prior guide to group patches, enhancing their spatial relationships. Additionally, patch siz...

Full description

Saved in:
Bibliographic Details
Main Authors: Ning Zhang, Ce Li, Zongshun Wang, Jialin Ma, Zhiqiang Feng
Format: Article
Language:English
Published: Springer 2025-03-01
Series:Discover Applied Sciences
Subjects:
Online Access:https://doi.org/10.1007/s42452-025-06570-9
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849390391900504064
author Ning Zhang
Ce Li
Zongshun Wang
Jialin Ma
Zhiqiang Feng
author_facet Ning Zhang
Ce Li
Zongshun Wang
Jialin Ma
Zhiqiang Feng
author_sort Ning Zhang
collection DOAJ
description Abstract Scene texts in nature exhibit varied colors, which serve as a significant distinguishing feature that effectively suppresses background interference. In this study, color clustering is utilized as a prior guide to group patches, enhancing their spatial relationships. Additionally, patch sizes are adaptively adjusted during training to balance speed and accuracy, while unimportant tokens and blocks in the model are pruned. We propose APViT, which modifies the ViTs model for scene text recognition requirements. It consists of three components: Sparse Patches Selection (SPS), ViT-STR, and Token Code (TC). First, SPS segments images into appropriate patches and clusters similar ones to explore diverse local patches adaptively. Second, we enhance the ViTs model specifically for scene text recognition as ViT-STR. Finally, TC prunes non-essential parts of the network based on self-attention mechanisms to accelerate performance. Consequently, our proposed APViT model outperforms state-of-the-art methods across several datasets, demonstrating its effectiveness.
format Article
id doaj-art-eae47f23328d4cdca400675f9bc2fd59
institution Kabale University
issn 3004-9261
language English
publishDate 2025-03-01
publisher Springer
record_format Article
series Discover Applied Sciences
spelling doaj-art-eae47f23328d4cdca400675f9bc2fd592025-08-20T03:41:40ZengSpringerDiscover Applied Sciences3004-92612025-03-017411410.1007/s42452-025-06570-9Apvit: ViT with adaptive patches for scene text recognitionNing Zhang0Ce Li1Zongshun Wang2Jialin Ma3Zhiqiang Feng4College of Electrical and Information Engineering, Lanzhou University of TechnologyCollege of Electrical and Information Engineering, Lanzhou University of TechnologyCollege of Electrical and Information Engineering, Lanzhou University of TechnologyCollege of Electrical and Information Engineering, Lanzhou University of TechnologyCollege of Electrical and Information Engineering, Lanzhou University of TechnologyAbstract Scene texts in nature exhibit varied colors, which serve as a significant distinguishing feature that effectively suppresses background interference. In this study, color clustering is utilized as a prior guide to group patches, enhancing their spatial relationships. Additionally, patch sizes are adaptively adjusted during training to balance speed and accuracy, while unimportant tokens and blocks in the model are pruned. We propose APViT, which modifies the ViTs model for scene text recognition requirements. It consists of three components: Sparse Patches Selection (SPS), ViT-STR, and Token Code (TC). First, SPS segments images into appropriate patches and clusters similar ones to explore diverse local patches adaptively. Second, we enhance the ViTs model specifically for scene text recognition as ViT-STR. Finally, TC prunes non-essential parts of the network based on self-attention mechanisms to accelerate performance. Consequently, our proposed APViT model outperforms state-of-the-art methods across several datasets, demonstrating its effectiveness.https://doi.org/10.1007/s42452-025-06570-9Adaptive patchesViTsScene text recognitionPrune
spellingShingle Ning Zhang
Ce Li
Zongshun Wang
Jialin Ma
Zhiqiang Feng
Apvit: ViT with adaptive patches for scene text recognition
Discover Applied Sciences
Adaptive patches
ViTs
Scene text recognition
Prune
title Apvit: ViT with adaptive patches for scene text recognition
title_full Apvit: ViT with adaptive patches for scene text recognition
title_fullStr Apvit: ViT with adaptive patches for scene text recognition
title_full_unstemmed Apvit: ViT with adaptive patches for scene text recognition
title_short Apvit: ViT with adaptive patches for scene text recognition
title_sort apvit vit with adaptive patches for scene text recognition
topic Adaptive patches
ViTs
Scene text recognition
Prune
url https://doi.org/10.1007/s42452-025-06570-9
work_keys_str_mv AT ningzhang apvitvitwithadaptivepatchesforscenetextrecognition
AT celi apvitvitwithadaptivepatchesforscenetextrecognition
AT zongshunwang apvitvitwithadaptivepatchesforscenetextrecognition
AT jialinma apvitvitwithadaptivepatchesforscenetextrecognition
AT zhiqiangfeng apvitvitwithadaptivepatchesforscenetextrecognition