Apvit: ViT with adaptive patches for scene text recognition
Abstract Scene texts in nature exhibit varied colors, which serve as a significant distinguishing feature that effectively suppresses background interference. In this study, color clustering is utilized as a prior guide to group patches, enhancing their spatial relationships. Additionally, patch siz...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2025-03-01
|
| Series: | Discover Applied Sciences |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s42452-025-06570-9 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849390391900504064 |
|---|---|
| author | Ning Zhang Ce Li Zongshun Wang Jialin Ma Zhiqiang Feng |
| author_facet | Ning Zhang Ce Li Zongshun Wang Jialin Ma Zhiqiang Feng |
| author_sort | Ning Zhang |
| collection | DOAJ |
| description | Abstract Scene texts in nature exhibit varied colors, which serve as a significant distinguishing feature that effectively suppresses background interference. In this study, color clustering is utilized as a prior guide to group patches, enhancing their spatial relationships. Additionally, patch sizes are adaptively adjusted during training to balance speed and accuracy, while unimportant tokens and blocks in the model are pruned. We propose APViT, which modifies the ViTs model for scene text recognition requirements. It consists of three components: Sparse Patches Selection (SPS), ViT-STR, and Token Code (TC). First, SPS segments images into appropriate patches and clusters similar ones to explore diverse local patches adaptively. Second, we enhance the ViTs model specifically for scene text recognition as ViT-STR. Finally, TC prunes non-essential parts of the network based on self-attention mechanisms to accelerate performance. Consequently, our proposed APViT model outperforms state-of-the-art methods across several datasets, demonstrating its effectiveness. |
| format | Article |
| id | doaj-art-eae47f23328d4cdca400675f9bc2fd59 |
| institution | Kabale University |
| issn | 3004-9261 |
| language | English |
| publishDate | 2025-03-01 |
| publisher | Springer |
| record_format | Article |
| series | Discover Applied Sciences |
| spelling | doaj-art-eae47f23328d4cdca400675f9bc2fd592025-08-20T03:41:40ZengSpringerDiscover Applied Sciences3004-92612025-03-017411410.1007/s42452-025-06570-9Apvit: ViT with adaptive patches for scene text recognitionNing Zhang0Ce Li1Zongshun Wang2Jialin Ma3Zhiqiang Feng4College of Electrical and Information Engineering, Lanzhou University of TechnologyCollege of Electrical and Information Engineering, Lanzhou University of TechnologyCollege of Electrical and Information Engineering, Lanzhou University of TechnologyCollege of Electrical and Information Engineering, Lanzhou University of TechnologyCollege of Electrical and Information Engineering, Lanzhou University of TechnologyAbstract Scene texts in nature exhibit varied colors, which serve as a significant distinguishing feature that effectively suppresses background interference. In this study, color clustering is utilized as a prior guide to group patches, enhancing their spatial relationships. Additionally, patch sizes are adaptively adjusted during training to balance speed and accuracy, while unimportant tokens and blocks in the model are pruned. We propose APViT, which modifies the ViTs model for scene text recognition requirements. It consists of three components: Sparse Patches Selection (SPS), ViT-STR, and Token Code (TC). First, SPS segments images into appropriate patches and clusters similar ones to explore diverse local patches adaptively. Second, we enhance the ViTs model specifically for scene text recognition as ViT-STR. Finally, TC prunes non-essential parts of the network based on self-attention mechanisms to accelerate performance. Consequently, our proposed APViT model outperforms state-of-the-art methods across several datasets, demonstrating its effectiveness.https://doi.org/10.1007/s42452-025-06570-9Adaptive patchesViTsScene text recognitionPrune |
| spellingShingle | Ning Zhang Ce Li Zongshun Wang Jialin Ma Zhiqiang Feng Apvit: ViT with adaptive patches for scene text recognition Discover Applied Sciences Adaptive patches ViTs Scene text recognition Prune |
| title | Apvit: ViT with adaptive patches for scene text recognition |
| title_full | Apvit: ViT with adaptive patches for scene text recognition |
| title_fullStr | Apvit: ViT with adaptive patches for scene text recognition |
| title_full_unstemmed | Apvit: ViT with adaptive patches for scene text recognition |
| title_short | Apvit: ViT with adaptive patches for scene text recognition |
| title_sort | apvit vit with adaptive patches for scene text recognition |
| topic | Adaptive patches ViTs Scene text recognition Prune |
| url | https://doi.org/10.1007/s42452-025-06570-9 |
| work_keys_str_mv | AT ningzhang apvitvitwithadaptivepatchesforscenetextrecognition AT celi apvitvitwithadaptivepatchesforscenetextrecognition AT zongshunwang apvitvitwithadaptivepatchesforscenetextrecognition AT jialinma apvitvitwithadaptivepatchesforscenetextrecognition AT zhiqiangfeng apvitvitwithadaptivepatchesforscenetextrecognition |