Single-Character-Based Embedding Feature Aggregation Using Cross-Attention for Scene Text Super-Resolution

In textual vision scenarios, super-resolution aims to enhance textual quality and readability to facilitate downstream tasks. However, the ambiguity of character regions in complex backgrounds remains challenging to mitigate, particularly the interference between tightly connected characters. In thi...

Full description

Saved in:

Bibliographic Details
Main Authors:	Meng Wang, Qianqian Li, Haipeng Liu
Format:	Article
Language:	English
Published:	MDPI AG 2025-04-01
Series:	Sensors
Subjects:	scene text image super-resolution cross-attention cross-fertilization text recognition
Online Access:	https://www.mdpi.com/1424-8220/25/7/2228
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849769329748344832
author	Meng Wang Qianqian Li Haipeng Liu
author_facet	Meng Wang Qianqian Li Haipeng Liu
author_sort	Meng Wang
collection	DOAJ
description	In textual vision scenarios, super-resolution aims to enhance textual quality and readability to facilitate downstream tasks. However, the ambiguity of character regions in complex backgrounds remains challenging to mitigate, particularly the interference between tightly connected characters. In this paper, we propose single-character-based embedding feature aggregation using cross-attention for scene text super-resolution (SCE-STISR) to solve this problem. Firstly, a dynamic feature extraction mechanism is employed to adaptively capture shallow features by dynamically adjusting multi-scale feature weights based on spatial representations. During text–image interactions, a dual-level cross-attention mechanism is introduced to comprehensively aggregate the cropped single-character features with textual prior, also aligning semantic sequences and visual features. Finally, an adaptive normalized color correction operation is applied to mitigate color distortion caused by background interference. In TextZoom benchmarking, the text recognition accuracies of different recognizers are 53.6%, 60.9%, and 64.5%, which are improved by 0.9–1.4% over the baseline TATT, achieving an optimal SSIM value of 0.7951 and a PSNR of 21.84. Additionally, our approach improves accuracy by 0.2–2.2% over existing baselines on five text recognition datasets, validating the effectiveness of the model.
format	Article
id	doaj-art-91a42cc04b7c46ffab9f03e56f68d0bf
institution	DOAJ
issn	1424-8220
language	English
publishDate	2025-04-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj-art-91a42cc04b7c46ffab9f03e56f68d0bf2025-08-20T03:03:27ZengMDPI AGSensors1424-82202025-04-01257222810.3390/s25072228Single-Character-Based Embedding Feature Aggregation Using Cross-Attention for Scene Text Super-ResolutionMeng Wang0Qianqian Li1Haipeng Liu2School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, ChinaSchool of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, ChinaSchool of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, ChinaIn textual vision scenarios, super-resolution aims to enhance textual quality and readability to facilitate downstream tasks. However, the ambiguity of character regions in complex backgrounds remains challenging to mitigate, particularly the interference between tightly connected characters. In this paper, we propose single-character-based embedding feature aggregation using cross-attention for scene text super-resolution (SCE-STISR) to solve this problem. Firstly, a dynamic feature extraction mechanism is employed to adaptively capture shallow features by dynamically adjusting multi-scale feature weights based on spatial representations. During text–image interactions, a dual-level cross-attention mechanism is introduced to comprehensively aggregate the cropped single-character features with textual prior, also aligning semantic sequences and visual features. Finally, an adaptive normalized color correction operation is applied to mitigate color distortion caused by background interference. In TextZoom benchmarking, the text recognition accuracies of different recognizers are 53.6%, 60.9%, and 64.5%, which are improved by 0.9–1.4% over the baseline TATT, achieving an optimal SSIM value of 0.7951 and a PSNR of 21.84. Additionally, our approach improves accuracy by 0.2–2.2% over existing baselines on five text recognition datasets, validating the effectiveness of the model.https://www.mdpi.com/1424-8220/25/7/2228scene text image super-resolutioncross-attentioncross-fertilizationtext recognition
spellingShingle	Meng Wang Qianqian Li Haipeng Liu Single-Character-Based Embedding Feature Aggregation Using Cross-Attention for Scene Text Super-Resolution Sensors scene text image super-resolution cross-attention cross-fertilization text recognition
title	Single-Character-Based Embedding Feature Aggregation Using Cross-Attention for Scene Text Super-Resolution
title_full	Single-Character-Based Embedding Feature Aggregation Using Cross-Attention for Scene Text Super-Resolution
title_fullStr	Single-Character-Based Embedding Feature Aggregation Using Cross-Attention for Scene Text Super-Resolution
title_full_unstemmed	Single-Character-Based Embedding Feature Aggregation Using Cross-Attention for Scene Text Super-Resolution
title_short	Single-Character-Based Embedding Feature Aggregation Using Cross-Attention for Scene Text Super-Resolution
title_sort	single character based embedding feature aggregation using cross attention for scene text super resolution
topic	scene text image super-resolution cross-attention cross-fertilization text recognition
url	https://www.mdpi.com/1424-8220/25/7/2228
work_keys_str_mv	AT mengwang singlecharacterbasedembeddingfeatureaggregationusingcrossattentionforscenetextsuperresolution AT qianqianli singlecharacterbasedembeddingfeatureaggregationusingcrossattentionforscenetextsuperresolution AT haipengliu singlecharacterbasedembeddingfeatureaggregationusingcrossattentionforscenetextsuperresolution

Single-Character-Based Embedding Feature Aggregation Using Cross-Attention for Scene Text Super-Resolution

Similar Items