Correlation-guided decoding strategy for low-resource Uyghur scene text recognition

Abstract Currently, most state-of-the-art scene text recognition methods are based on the Transformer architecture and rely on pre-trained large language models. However, these pre-trained models are primarily designed for resource-rich languages and exhibit limitations when applied to low-resource...

Full description

Saved in:
Bibliographic Details
Main Authors: Miaomiao Xu, Jiang Zhang, Lianghui Xu, Wushour Silamu, Yanbing Li
Format: Article
Language:English
Published: Springer 2024-11-01
Series:Complex & Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1007/s40747-024-01689-5
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832571146643439616
author Miaomiao Xu
Jiang Zhang
Lianghui Xu
Wushour Silamu
Yanbing Li
author_facet Miaomiao Xu
Jiang Zhang
Lianghui Xu
Wushour Silamu
Yanbing Li
author_sort Miaomiao Xu
collection DOAJ
description Abstract Currently, most state-of-the-art scene text recognition methods are based on the Transformer architecture and rely on pre-trained large language models. However, these pre-trained models are primarily designed for resource-rich languages and exhibit limitations when applied to low-resource languages. We propose a Correlation-Guided Decoding Strategy for Low-Resource Uyghur Scene Text Recognition (CGDS). Specifically, (1) CGDS employs a hybrid encoding strategy that combines Convolutional Neural Network (CNN) and Transformer. This hybrid encoding effectively leverages the advantages of both methods: On one hand, the convolutional properties and shared weight mechanism of CNN allow for efficient extraction of local features, reducing dependency on large datasets and minimizing errors caused by similar characters. On the other hand, the global attention mechanism of Transformer captures longer-distance dependencies, enhancing the informational linkage between characters and thereby improving recognition accuracy. Finally, through a dynamic fusion method, the features from CNN and Transformer are dynamically integrated, adaptively allocating the weights of CNN and Transformer features during the model training process, thereby achieving a dynamic balance between local and global features. (2) To further enhance the feature extraction capabilities, we designed a Correlation-Guided Decoding (CGD) module. Unlike existing decoding strategies, we adopt a dual-decoder approach with the Transformer and CGD decoders. The role of the CGD decoder is to perform correlation calculations using the outputs from the Transformer decoder and the encoder to optimize the final recognition performance. At the same time, the CGD decoder can utilize the outputs from the Transformer decoder to provide semantic guidance for the feature extraction of the encoder, enabling the model to understand the semantic structure within the input data better. This dual-decoder strategy can better guide the model in extracting effective features, enhancing the model’s ability to learn internal language knowledge and more fully utilize the useful information in the input data. (3) We constructed two Uyghur scene text datasets named U1 and U2. Experimental results show that our method achieves superior performance in low-resource Uyghur scene text recognition compared to existing technologies. Specifically, CGDS improved accuracy by 50.2% on the U1 and 13.6% on the U2 and achieved an overall accuracy improvement of 15.9%.
format Article
id doaj-art-19e81db76db641d5bc06224c875c4247
institution Kabale University
issn 2199-4536
2198-6053
language English
publishDate 2024-11-01
publisher Springer
record_format Article
series Complex & Intelligent Systems
spelling doaj-art-19e81db76db641d5bc06224c875c42472025-02-02T12:49:11ZengSpringerComplex & Intelligent Systems2199-45362198-60532024-11-0111111410.1007/s40747-024-01689-5Correlation-guided decoding strategy for low-resource Uyghur scene text recognitionMiaomiao Xu0Jiang Zhang1Lianghui Xu2Wushour Silamu3Yanbing Li4College of Computer Science and Technology, Xinjiang UniversityCollege of Computer Science and Technology, Xinjiang UniversityCollege of Computer Science and Technology, Xinjiang UniversityCollege of Computer Science and Technology, Xinjiang UniversityCollege of Computer Science and Technology, Xinjiang UniversityAbstract Currently, most state-of-the-art scene text recognition methods are based on the Transformer architecture and rely on pre-trained large language models. However, these pre-trained models are primarily designed for resource-rich languages and exhibit limitations when applied to low-resource languages. We propose a Correlation-Guided Decoding Strategy for Low-Resource Uyghur Scene Text Recognition (CGDS). Specifically, (1) CGDS employs a hybrid encoding strategy that combines Convolutional Neural Network (CNN) and Transformer. This hybrid encoding effectively leverages the advantages of both methods: On one hand, the convolutional properties and shared weight mechanism of CNN allow for efficient extraction of local features, reducing dependency on large datasets and minimizing errors caused by similar characters. On the other hand, the global attention mechanism of Transformer captures longer-distance dependencies, enhancing the informational linkage between characters and thereby improving recognition accuracy. Finally, through a dynamic fusion method, the features from CNN and Transformer are dynamically integrated, adaptively allocating the weights of CNN and Transformer features during the model training process, thereby achieving a dynamic balance between local and global features. (2) To further enhance the feature extraction capabilities, we designed a Correlation-Guided Decoding (CGD) module. Unlike existing decoding strategies, we adopt a dual-decoder approach with the Transformer and CGD decoders. The role of the CGD decoder is to perform correlation calculations using the outputs from the Transformer decoder and the encoder to optimize the final recognition performance. At the same time, the CGD decoder can utilize the outputs from the Transformer decoder to provide semantic guidance for the feature extraction of the encoder, enabling the model to understand the semantic structure within the input data better. This dual-decoder strategy can better guide the model in extracting effective features, enhancing the model’s ability to learn internal language knowledge and more fully utilize the useful information in the input data. (3) We constructed two Uyghur scene text datasets named U1 and U2. Experimental results show that our method achieves superior performance in low-resource Uyghur scene text recognition compared to existing technologies. Specifically, CGDS improved accuracy by 50.2% on the U1 and 13.6% on the U2 and achieved an overall accuracy improvement of 15.9%.https://doi.org/10.1007/s40747-024-01689-5Scene text recognitionLow-resource UyghurCorrelation-guided decoding strategyHybrid encoding
spellingShingle Miaomiao Xu
Jiang Zhang
Lianghui Xu
Wushour Silamu
Yanbing Li
Correlation-guided decoding strategy for low-resource Uyghur scene text recognition
Complex & Intelligent Systems
Scene text recognition
Low-resource Uyghur
Correlation-guided decoding strategy
Hybrid encoding
title Correlation-guided decoding strategy for low-resource Uyghur scene text recognition
title_full Correlation-guided decoding strategy for low-resource Uyghur scene text recognition
title_fullStr Correlation-guided decoding strategy for low-resource Uyghur scene text recognition
title_full_unstemmed Correlation-guided decoding strategy for low-resource Uyghur scene text recognition
title_short Correlation-guided decoding strategy for low-resource Uyghur scene text recognition
title_sort correlation guided decoding strategy for low resource uyghur scene text recognition
topic Scene text recognition
Low-resource Uyghur
Correlation-guided decoding strategy
Hybrid encoding
url https://doi.org/10.1007/s40747-024-01689-5
work_keys_str_mv AT miaomiaoxu correlationguideddecodingstrategyforlowresourceuyghurscenetextrecognition
AT jiangzhang correlationguideddecodingstrategyforlowresourceuyghurscenetextrecognition
AT lianghuixu correlationguideddecodingstrategyforlowresourceuyghurscenetextrecognition
AT wushoursilamu correlationguideddecodingstrategyforlowresourceuyghurscenetextrecognition
AT yanbingli correlationguideddecodingstrategyforlowresourceuyghurscenetextrecognition