IMTLM-Net: improved multi-task transformer based on localization mechanism network for handwritten English text recognition

Abstract Intelligence technology has widely empowered education. As an example, Optical Character Recognition (OCR) can be used in smart education scenarios such as online homework correction and teaching data analysis. One of the fundamental yet challenging tasks is to recognize images of handwritt...

Full description

Saved in:
Bibliographic Details
Main Authors: Qianfeng Zhang, Feng Liu, Wanru Song
Format: Article
Language:English
Published: Springer 2025-01-01
Series:Complex & Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1007/s40747-024-01713-8
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832571191346331648
author Qianfeng Zhang
Feng Liu
Wanru Song
author_facet Qianfeng Zhang
Feng Liu
Wanru Song
author_sort Qianfeng Zhang
collection DOAJ
description Abstract Intelligence technology has widely empowered education. As an example, Optical Character Recognition (OCR) can be used in smart education scenarios such as online homework correction and teaching data analysis. One of the fundamental yet challenging tasks is to recognize images of handwritten English text as editable text accurately. This is because handwritten text tends to have different writing habits as well as smearing and overlapping, resulting in hard alignment between the image and the real text. Additionally, the lack of data on handwritten text further leads to a lower recognition rate. To address the above issue, on the one hand, this paper extends the existing dataset and introduces hyphenated data annotation to provide data support for improving the robustness and discrimination of the model; on the other hand, a novel framework named Improved Multi-task Transformer based on Localization Mechanism Network (IMTLM-Net) is proposed for handwritten English text recognition. IMTLM-Net contains two parts, namely the encoding and decoding modules. The encoding module introduces a dual-stream processing mechanism. That is, in the simultaneous processing of text and images, a Vision Transformer (VIT) is utilized to encode images, and a Permutation Language Model (PLM) is designed for word arrangement. Two Multiple Head Attention (MHA) units are employed in the decoding module, focusing on text sequences and image sequences. Moreover, the localization mechanism (LM) is applied to enhance font structure feature extraction from image data, which in turn improves the model’s ability to capture complex details. Numerous experiments demonstrate that the proposed method achieves state-of-the-art results in handwritten text recognition.
format Article
id doaj-art-b2e1ad69088943cead926406efc53a21
institution Kabale University
issn 2199-4536
2198-6053
language English
publishDate 2025-01-01
publisher Springer
record_format Article
series Complex & Intelligent Systems
spelling doaj-art-b2e1ad69088943cead926406efc53a212025-02-02T12:49:21ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-01-0111111810.1007/s40747-024-01713-8IMTLM-Net: improved multi-task transformer based on localization mechanism network for handwritten English text recognitionQianfeng Zhang0Feng Liu1Wanru Song2School of Communication and Information Engineering, Nanjing University of Posts and TelecommunicationsSchool of Communication and Information Engineering, Nanjing University of Posts and TelecommunicationsSchool of Educational Science and Technology, Nanjing University of Posts and TelecommunicationsAbstract Intelligence technology has widely empowered education. As an example, Optical Character Recognition (OCR) can be used in smart education scenarios such as online homework correction and teaching data analysis. One of the fundamental yet challenging tasks is to recognize images of handwritten English text as editable text accurately. This is because handwritten text tends to have different writing habits as well as smearing and overlapping, resulting in hard alignment between the image and the real text. Additionally, the lack of data on handwritten text further leads to a lower recognition rate. To address the above issue, on the one hand, this paper extends the existing dataset and introduces hyphenated data annotation to provide data support for improving the robustness and discrimination of the model; on the other hand, a novel framework named Improved Multi-task Transformer based on Localization Mechanism Network (IMTLM-Net) is proposed for handwritten English text recognition. IMTLM-Net contains two parts, namely the encoding and decoding modules. The encoding module introduces a dual-stream processing mechanism. That is, in the simultaneous processing of text and images, a Vision Transformer (VIT) is utilized to encode images, and a Permutation Language Model (PLM) is designed for word arrangement. Two Multiple Head Attention (MHA) units are employed in the decoding module, focusing on text sequences and image sequences. Moreover, the localization mechanism (LM) is applied to enhance font structure feature extraction from image data, which in turn improves the model’s ability to capture complex details. Numerous experiments demonstrate that the proposed method achieves state-of-the-art results in handwritten text recognition.https://doi.org/10.1007/s40747-024-01713-8Handwritten English text recognitionEnglish composition datasetTransformerLocal feature extraction
spellingShingle Qianfeng Zhang
Feng Liu
Wanru Song
IMTLM-Net: improved multi-task transformer based on localization mechanism network for handwritten English text recognition
Complex & Intelligent Systems
Handwritten English text recognition
English composition dataset
Transformer
Local feature extraction
title IMTLM-Net: improved multi-task transformer based on localization mechanism network for handwritten English text recognition
title_full IMTLM-Net: improved multi-task transformer based on localization mechanism network for handwritten English text recognition
title_fullStr IMTLM-Net: improved multi-task transformer based on localization mechanism network for handwritten English text recognition
title_full_unstemmed IMTLM-Net: improved multi-task transformer based on localization mechanism network for handwritten English text recognition
title_short IMTLM-Net: improved multi-task transformer based on localization mechanism network for handwritten English text recognition
title_sort imtlm net improved multi task transformer based on localization mechanism network for handwritten english text recognition
topic Handwritten English text recognition
English composition dataset
Transformer
Local feature extraction
url https://doi.org/10.1007/s40747-024-01713-8
work_keys_str_mv AT qianfengzhang imtlmnetimprovedmultitasktransformerbasedonlocalizationmechanismnetworkforhandwrittenenglishtextrecognition
AT fengliu imtlmnetimprovedmultitasktransformerbasedonlocalizationmechanismnetworkforhandwrittenenglishtextrecognition
AT wanrusong imtlmnetimprovedmultitasktransformerbasedonlocalizationmechanismnetworkforhandwrittenenglishtextrecognition