Correlation-guided decoding strategy for low-resource Uyghur scene text recognition

Abstract Currently, most state-of-the-art scene text recognition methods are based on the Transformer architecture and rely on pre-trained large language models. However, these pre-trained models are primarily designed for resource-rich languages and exhibit limitations when applied to low-resource...

Full description

Saved in:

Bibliographic Details
Main Authors:	Miaomiao Xu, Jiang Zhang, Lianghui Xu, Wushour Silamu, Yanbing Li
Format:	Article
Language:	English
Published:	Springer 2024-11-01
Series:	Complex & Intelligent Systems
Subjects:	Scene text recognition Low-resource Uyghur Correlation-guided decoding strategy Hybrid encoding
Online Access:	https://doi.org/10.1007/s40747-024-01689-5
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832571146643439616
author	Miaomiao Xu Jiang Zhang Lianghui Xu Wushour Silamu Yanbing Li
author_facet	Miaomiao Xu Jiang Zhang Lianghui Xu Wushour Silamu Yanbing Li
author_sort	Miaomiao Xu
collection	DOAJ
description	Abstract Currently, most state-of-the-art scene text recognition methods are based on the Transformer architecture and rely on pre-trained large language models. However, these pre-trained models are primarily designed for resource-rich languages and exhibit limitations when applied to low-resource languages. We propose a Correlation-Guided Decoding Strategy for Low-Resource Uyghur Scene Text Recognition (CGDS). Specifically, (1) CGDS employs a hybrid encoding strategy that combines Convolutional Neural Network (CNN) and Transformer. This hybrid encoding effectively leverages the advantages of both methods: On one hand, the convolutional properties and shared weight mechanism of CNN allow for efficient extraction of local features, reducing dependency on large datasets and minimizing errors caused by similar characters. On the other hand, the global attention mechanism of Transformer captures longer-distance dependencies, enhancing the informational linkage between characters and thereby improving recognition accuracy. Finally, through a dynamic fusion method, the features from CNN and Transformer are dynamically integrated, adaptively allocating the weights of CNN and Transformer features during the model training process, thereby achieving a dynamic balance between local and global features. (2) To further enhance the feature extraction capabilities, we designed a Correlation-Guided Decoding (CGD) module. Unlike existing decoding strategies, we adopt a dual-decoder approach with the Transformer and CGD decoders. The role of the CGD decoder is to perform correlation calculations using the outputs from the Transformer decoder and the encoder to optimize the final recognition performance. At the same time, the CGD decoder can utilize the outputs from the Transformer decoder to provide semantic guidance for the feature extraction of the encoder, enabling the model to understand the semantic structure within the input data better. This dual-decoder strategy can better guide the model in extracting effective features, enhancing the model’s ability to learn internal language knowledge and more fully utilize the useful information in the input data. (3) We constructed two Uyghur scene text datasets named U1 and U2. Experimental results show that our method achieves superior performance in low-resource Uyghur scene text recognition compared to existing technologies. Specifically, CGDS improved accuracy by 50.2% on the U1 and 13.6% on the U2 and achieved an overall accuracy improvement of 15.9%.
format	Article
id	doaj-art-19e81db76db641d5bc06224c875c4247
institution	Kabale University
issn	2199-4536 2198-6053
language	English
publishDate	2024-11-01
publisher	Springer
record_format	Article
series	Complex & Intelligent Systems
spelling	doaj-art-19e81db76db641d5bc06224c875c42472025-02-02T12:49:11ZengSpringerComplex & Intelligent Systems2199-45362198-60532024-11-0111111410.1007/s40747-024-01689-5Correlation-guided decoding strategy for low-resource Uyghur scene text recognitionMiaomiao Xu0Jiang Zhang1Lianghui Xu2Wushour Silamu3Yanbing Li4College of Computer Science and Technology, Xinjiang UniversityCollege of Computer Science and Technology, Xinjiang UniversityCollege of Computer Science and Technology, Xinjiang UniversityCollege of Computer Science and Technology, Xinjiang UniversityCollege of Computer Science and Technology, Xinjiang UniversityAbstract Currently, most state-of-the-art scene text recognition methods are based on the Transformer architecture and rely on pre-trained large language models. However, these pre-trained models are primarily designed for resource-rich languages and exhibit limitations when applied to low-resource languages. We propose a Correlation-Guided Decoding Strategy for Low-Resource Uyghur Scene Text Recognition (CGDS). Specifically, (1) CGDS employs a hybrid encoding strategy that combines Convolutional Neural Network (CNN) and Transformer. This hybrid encoding effectively leverages the advantages of both methods: On one hand, the convolutional properties and shared weight mechanism of CNN allow for efficient extraction of local features, reducing dependency on large datasets and minimizing errors caused by similar characters. On the other hand, the global attention mechanism of Transformer captures longer-distance dependencies, enhancing the informational linkage between characters and thereby improving recognition accuracy. Finally, through a dynamic fusion method, the features from CNN and Transformer are dynamically integrated, adaptively allocating the weights of CNN and Transformer features during the model training process, thereby achieving a dynamic balance between local and global features. (2) To further enhance the feature extraction capabilities, we designed a Correlation-Guided Decoding (CGD) module. Unlike existing decoding strategies, we adopt a dual-decoder approach with the Transformer and CGD decoders. The role of the CGD decoder is to perform correlation calculations using the outputs from the Transformer decoder and the encoder to optimize the final recognition performance. At the same time, the CGD decoder can utilize the outputs from the Transformer decoder to provide semantic guidance for the feature extraction of the encoder, enabling the model to understand the semantic structure within the input data better. This dual-decoder strategy can better guide the model in extracting effective features, enhancing the model’s ability to learn internal language knowledge and more fully utilize the useful information in the input data. (3) We constructed two Uyghur scene text datasets named U1 and U2. Experimental results show that our method achieves superior performance in low-resource Uyghur scene text recognition compared to existing technologies. Specifically, CGDS improved accuracy by 50.2% on the U1 and 13.6% on the U2 and achieved an overall accuracy improvement of 15.9%.https://doi.org/10.1007/s40747-024-01689-5Scene text recognitionLow-resource UyghurCorrelation-guided decoding strategyHybrid encoding
spellingShingle	Miaomiao Xu Jiang Zhang Lianghui Xu Wushour Silamu Yanbing Li Correlation-guided decoding strategy for low-resource Uyghur scene text recognition Complex & Intelligent Systems Scene text recognition Low-resource Uyghur Correlation-guided decoding strategy Hybrid encoding
title	Correlation-guided decoding strategy for low-resource Uyghur scene text recognition
title_full	Correlation-guided decoding strategy for low-resource Uyghur scene text recognition
title_fullStr	Correlation-guided decoding strategy for low-resource Uyghur scene text recognition
title_full_unstemmed	Correlation-guided decoding strategy for low-resource Uyghur scene text recognition
title_short	Correlation-guided decoding strategy for low-resource Uyghur scene text recognition
title_sort	correlation guided decoding strategy for low resource uyghur scene text recognition
topic	Scene text recognition Low-resource Uyghur Correlation-guided decoding strategy Hybrid encoding
url	https://doi.org/10.1007/s40747-024-01689-5
work_keys_str_mv	AT miaomiaoxu correlationguideddecodingstrategyforlowresourceuyghurscenetextrecognition AT jiangzhang correlationguideddecodingstrategyforlowresourceuyghurscenetextrecognition AT lianghuixu correlationguideddecodingstrategyforlowresourceuyghurscenetextrecognition AT wushoursilamu correlationguideddecodingstrategyforlowresourceuyghurscenetextrecognition AT yanbingli correlationguideddecodingstrategyforlowresourceuyghurscenetextrecognition

Correlation-guided decoding strategy for low-resource Uyghur scene text recognition

Similar Items