IMTLM-Net: improved multi-task transformer based on localization mechanism network for handwritten English text recognition

Abstract Intelligence technology has widely empowered education. As an example, Optical Character Recognition (OCR) can be used in smart education scenarios such as online homework correction and teaching data analysis. One of the fundamental yet challenging tasks is to recognize images of handwritt...

Full description

Saved in:

Bibliographic Details
Main Authors:	Qianfeng Zhang, Feng Liu, Wanru Song
Format:	Article
Language:	English
Published:	Springer 2025-01-01
Series:	Complex & Intelligent Systems
Subjects:	Handwritten English text recognition English composition dataset Transformer Local feature extraction
Online Access:	https://doi.org/10.1007/s40747-024-01713-8
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832571191346331648
author	Qianfeng Zhang Feng Liu Wanru Song
author_facet	Qianfeng Zhang Feng Liu Wanru Song
author_sort	Qianfeng Zhang
collection	DOAJ
description	Abstract Intelligence technology has widely empowered education. As an example, Optical Character Recognition (OCR) can be used in smart education scenarios such as online homework correction and teaching data analysis. One of the fundamental yet challenging tasks is to recognize images of handwritten English text as editable text accurately. This is because handwritten text tends to have different writing habits as well as smearing and overlapping, resulting in hard alignment between the image and the real text. Additionally, the lack of data on handwritten text further leads to a lower recognition rate. To address the above issue, on the one hand, this paper extends the existing dataset and introduces hyphenated data annotation to provide data support for improving the robustness and discrimination of the model; on the other hand, a novel framework named Improved Multi-task Transformer based on Localization Mechanism Network (IMTLM-Net) is proposed for handwritten English text recognition. IMTLM-Net contains two parts, namely the encoding and decoding modules. The encoding module introduces a dual-stream processing mechanism. That is, in the simultaneous processing of text and images, a Vision Transformer (VIT) is utilized to encode images, and a Permutation Language Model (PLM) is designed for word arrangement. Two Multiple Head Attention (MHA) units are employed in the decoding module, focusing on text sequences and image sequences. Moreover, the localization mechanism (LM) is applied to enhance font structure feature extraction from image data, which in turn improves the model’s ability to capture complex details. Numerous experiments demonstrate that the proposed method achieves state-of-the-art results in handwritten text recognition.
format	Article
id	doaj-art-b2e1ad69088943cead926406efc53a21
institution	Kabale University
issn	2199-4536 2198-6053
language	English
publishDate	2025-01-01
publisher	Springer
record_format	Article
series	Complex & Intelligent Systems
spelling	doaj-art-b2e1ad69088943cead926406efc53a212025-02-02T12:49:21ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-01-0111111810.1007/s40747-024-01713-8IMTLM-Net: improved multi-task transformer based on localization mechanism network for handwritten English text recognitionQianfeng Zhang0Feng Liu1Wanru Song2School of Communication and Information Engineering, Nanjing University of Posts and TelecommunicationsSchool of Communication and Information Engineering, Nanjing University of Posts and TelecommunicationsSchool of Educational Science and Technology, Nanjing University of Posts and TelecommunicationsAbstract Intelligence technology has widely empowered education. As an example, Optical Character Recognition (OCR) can be used in smart education scenarios such as online homework correction and teaching data analysis. One of the fundamental yet challenging tasks is to recognize images of handwritten English text as editable text accurately. This is because handwritten text tends to have different writing habits as well as smearing and overlapping, resulting in hard alignment between the image and the real text. Additionally, the lack of data on handwritten text further leads to a lower recognition rate. To address the above issue, on the one hand, this paper extends the existing dataset and introduces hyphenated data annotation to provide data support for improving the robustness and discrimination of the model; on the other hand, a novel framework named Improved Multi-task Transformer based on Localization Mechanism Network (IMTLM-Net) is proposed for handwritten English text recognition. IMTLM-Net contains two parts, namely the encoding and decoding modules. The encoding module introduces a dual-stream processing mechanism. That is, in the simultaneous processing of text and images, a Vision Transformer (VIT) is utilized to encode images, and a Permutation Language Model (PLM) is designed for word arrangement. Two Multiple Head Attention (MHA) units are employed in the decoding module, focusing on text sequences and image sequences. Moreover, the localization mechanism (LM) is applied to enhance font structure feature extraction from image data, which in turn improves the model’s ability to capture complex details. Numerous experiments demonstrate that the proposed method achieves state-of-the-art results in handwritten text recognition.https://doi.org/10.1007/s40747-024-01713-8Handwritten English text recognitionEnglish composition datasetTransformerLocal feature extraction
spellingShingle	Qianfeng Zhang Feng Liu Wanru Song IMTLM-Net: improved multi-task transformer based on localization mechanism network for handwritten English text recognition Complex & Intelligent Systems Handwritten English text recognition English composition dataset Transformer Local feature extraction
title	IMTLM-Net: improved multi-task transformer based on localization mechanism network for handwritten English text recognition
title_full	IMTLM-Net: improved multi-task transformer based on localization mechanism network for handwritten English text recognition
title_fullStr	IMTLM-Net: improved multi-task transformer based on localization mechanism network for handwritten English text recognition
title_full_unstemmed	IMTLM-Net: improved multi-task transformer based on localization mechanism network for handwritten English text recognition
title_short	IMTLM-Net: improved multi-task transformer based on localization mechanism network for handwritten English text recognition
title_sort	imtlm net improved multi task transformer based on localization mechanism network for handwritten english text recognition
topic	Handwritten English text recognition English composition dataset Transformer Local feature extraction
url	https://doi.org/10.1007/s40747-024-01713-8
work_keys_str_mv	AT qianfengzhang imtlmnetimprovedmultitasktransformerbasedonlocalizationmechanismnetworkforhandwrittenenglishtextrecognition AT fengliu imtlmnetimprovedmultitasktransformerbasedonlocalizationmechanismnetworkforhandwrittenenglishtextrecognition AT wanrusong imtlmnetimprovedmultitasktransformerbasedonlocalizationmechanismnetworkforhandwrittenenglishtextrecognition

IMTLM-Net: improved multi-task transformer based on localization mechanism network for handwritten English text recognition

Similar Items