Apvit: ViT with adaptive patches for scene text recognition

Abstract Scene texts in nature exhibit varied colors, which serve as a significant distinguishing feature that effectively suppresses background interference. In this study, color clustering is utilized as a prior guide to group patches, enhancing their spatial relationships. Additionally, patch siz...

Full description

Saved in:
Bibliographic Details
Main Authors: Ning Zhang, Ce Li, Zongshun Wang, Jialin Ma, Zhiqiang Feng
Format: Article
Language:English
Published: Springer 2025-03-01
Series:Discover Applied Sciences
Subjects:
Online Access:https://doi.org/10.1007/s42452-025-06570-9
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Scene texts in nature exhibit varied colors, which serve as a significant distinguishing feature that effectively suppresses background interference. In this study, color clustering is utilized as a prior guide to group patches, enhancing their spatial relationships. Additionally, patch sizes are adaptively adjusted during training to balance speed and accuracy, while unimportant tokens and blocks in the model are pruned. We propose APViT, which modifies the ViTs model for scene text recognition requirements. It consists of three components: Sparse Patches Selection (SPS), ViT-STR, and Token Code (TC). First, SPS segments images into appropriate patches and clusters similar ones to explore diverse local patches adaptively. Second, we enhance the ViTs model specifically for scene text recognition as ViT-STR. Finally, TC prunes non-essential parts of the network based on self-attention mechanisms to accelerate performance. Consequently, our proposed APViT model outperforms state-of-the-art methods across several datasets, demonstrating its effectiveness.
ISSN:3004-9261