Adapting Vision Transformer-Based Object Detection Model for Handwritten Text Line Segmentation Task

This study introduces a novel approach for segmenting lines of text in handwritten documents using a vision transformer model. Specifically, we adapt DEtection TRansformer (DETR) model to detect line segments in images of handwritten documents. In order to adapt DETR for the line segmentation task,...

Full description

Saved in:

Bibliographic Details
Main Authors:	Osman Furkan KARAKUŞ, Ayla GULCU, Ali Can KARACA
Format:	Article
Language:	English
Published:	Bursa Technical University 2025-06-01
Series:	Journal of Innovative Science and Engineering
Subjects:	vision transformers handwritten text line segmentation object detection optical character recognition
Online Access:	http://jise.btu.edu.tr/en/download/article-file/3874269
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849251738438074368
author	Osman Furkan KARAKUŞ Ayla GULCU Ali Can KARACA
author_facet	Osman Furkan KARAKUŞ Ayla GULCU Ali Can KARACA
author_sort	Osman Furkan KARAKUŞ
collection	DOAJ
description	This study introduces a novel approach for segmenting lines of text in handwritten documents using a vision transformer model. Specifically, we adapt DEtection TRansformer (DETR) model to detect line segments in images of handwritten documents. In order to adapt DETR for the line segmentation task, we applied a pre-processing step that involves dividing each line into fixed-size image patches followed by adding positional encoding. We benefit from DETR model with a ResNet-101 backbone pretrained on the Common Objects in Context (COCO) object detection training dataset, and re-train this model using our novel, complex line segmentation dataset consisting of 1,610 handwritten forms. To evaluate the performance, another line segmentation method named Bangla Document Recognition through Instance-level Segmentation of Handwritten Text Images (BN-DRISHTI) is implemented. This method utilizes the You Only Look Once (YOLO) object detection model. Both object detection-based methods involve a learning phase during which the model is trained or fine-tuned on the dataset. For a diverse set of baselines methods, we have also implemented two learning-free algorithms such as A* Search Algorithm and the Genetic Algorithm (GA). Experimental results based on the Intersection over Union (IoU) metric demonstrate that the proposed method outperforms all other methods in terms of the detection rate, recognition accuracy, and Text Line Detection Metric (TLDM). The quantitative results also indicate that two learning-free algorithms fail to segment highly skewed lines successfully in the dataset. The A* algorithm achieves a high recognition accuracy of 0.734, compared to GA and BN-DRISHTI, which achieve recognition accuracies of 0.498 and 0.689, respectively. Our proposed approach achieves the highest recognition accuracy of 0.872, outperforming all other methods. We show that the DETR model which requires only a single fine-tuning phase for adapting to line-segmentation task, not only simplifies the training and implementation process but also improves accuracy and efficiency in detecting and segmenting handwritten text lines. DETR’s use of a transformer’s global attention mechanism allows it to better understand the entire context of an image rather than relying solely on local features. This is particularly beneficial for managing the diverse and complex patterns found in handwritten text where traditional models might struggle with issues such as overlapping text lines or varied handwriting styles.
format	Article
id	doaj-art-c8f68bc65c214ca093668092136a9c85
institution	Kabale University
issn	2602-4217
language	English
publishDate	2025-06-01
publisher	Bursa Technical University
record_format	Article
series	Journal of Innovative Science and Engineering
spelling	doaj-art-c8f68bc65c214ca093668092136a9c852025-08-20T03:56:50ZengBursa Technical UniversityJournal of Innovative Science and Engineering2602-42172025-06-0191283810.38088/jise.1471047Adapting Vision Transformer-Based Object Detection Model for Handwritten Text Line Segmentation TaskOsman Furkan KARAKUŞ0https://orcid.org/0000-0003-3017-7715Ayla GULCU1https://orcid.org/0000-0003-3258-8681Ali Can KARACA2https://orcid.org/0000-0002-6835-7634Yıldız Technical UniversityBahcesehir UniversityYıldız Technical UniversityThis study introduces a novel approach for segmenting lines of text in handwritten documents using a vision transformer model. Specifically, we adapt DEtection TRansformer (DETR) model to detect line segments in images of handwritten documents. In order to adapt DETR for the line segmentation task, we applied a pre-processing step that involves dividing each line into fixed-size image patches followed by adding positional encoding. We benefit from DETR model with a ResNet-101 backbone pretrained on the Common Objects in Context (COCO) object detection training dataset, and re-train this model using our novel, complex line segmentation dataset consisting of 1,610 handwritten forms. To evaluate the performance, another line segmentation method named Bangla Document Recognition through Instance-level Segmentation of Handwritten Text Images (BN-DRISHTI) is implemented. This method utilizes the You Only Look Once (YOLO) object detection model. Both object detection-based methods involve a learning phase during which the model is trained or fine-tuned on the dataset. For a diverse set of baselines methods, we have also implemented two learning-free algorithms such as A* Search Algorithm and the Genetic Algorithm (GA). Experimental results based on the Intersection over Union (IoU) metric demonstrate that the proposed method outperforms all other methods in terms of the detection rate, recognition accuracy, and Text Line Detection Metric (TLDM). The quantitative results also indicate that two learning-free algorithms fail to segment highly skewed lines successfully in the dataset. The A* algorithm achieves a high recognition accuracy of 0.734, compared to GA and BN-DRISHTI, which achieve recognition accuracies of 0.498 and 0.689, respectively. Our proposed approach achieves the highest recognition accuracy of 0.872, outperforming all other methods. We show that the DETR model which requires only a single fine-tuning phase for adapting to line-segmentation task, not only simplifies the training and implementation process but also improves accuracy and efficiency in detecting and segmenting handwritten text lines. DETR’s use of a transformer’s global attention mechanism allows it to better understand the entire context of an image rather than relying solely on local features. This is particularly beneficial for managing the diverse and complex patterns found in handwritten text where traditional models might struggle with issues such as overlapping text lines or varied handwriting styles.http://jise.btu.edu.tr/en/download/article-file/3874269vision transformershandwritten text line segmentationobject detectionoptical character recognition
spellingShingle	Osman Furkan KARAKUŞ Ayla GULCU Ali Can KARACA Adapting Vision Transformer-Based Object Detection Model for Handwritten Text Line Segmentation Task Journal of Innovative Science and Engineering vision transformers handwritten text line segmentation object detection optical character recognition
title	Adapting Vision Transformer-Based Object Detection Model for Handwritten Text Line Segmentation Task
title_full	Adapting Vision Transformer-Based Object Detection Model for Handwritten Text Line Segmentation Task
title_fullStr	Adapting Vision Transformer-Based Object Detection Model for Handwritten Text Line Segmentation Task
title_full_unstemmed	Adapting Vision Transformer-Based Object Detection Model for Handwritten Text Line Segmentation Task
title_short	Adapting Vision Transformer-Based Object Detection Model for Handwritten Text Line Segmentation Task
title_sort	adapting vision transformer based object detection model for handwritten text line segmentation task
topic	vision transformers handwritten text line segmentation object detection optical character recognition
url	http://jise.btu.edu.tr/en/download/article-file/3874269
work_keys_str_mv	AT osmanfurkankarakus adaptingvisiontransformerbasedobjectdetectionmodelforhandwrittentextlinesegmentationtask AT aylagulcu adaptingvisiontransformerbasedobjectdetectionmodelforhandwrittentextlinesegmentationtask AT alicankaraca adaptingvisiontransformerbasedobjectdetectionmodelforhandwrittentextlinesegmentationtask

Adapting Vision Transformer-Based Object Detection Model for Handwritten Text Line Segmentation Task

Similar Items