Adapting Vision Transformer-Based Object Detection Model for Handwritten Text Line Segmentation Task

This study introduces a novel approach for segmenting lines of text in handwritten documents using a vision transformer model. Specifically, we adapt DEtection TRansformer (DETR) model to detect line segments in images of handwritten documents. In order to adapt DETR for the line segmentation task,...

Full description

Saved in:
Bibliographic Details
Main Authors: Osman Furkan KARAKUŞ, Ayla GULCU, Ali Can KARACA
Format: Article
Language:English
Published: Bursa Technical University 2025-06-01
Series:Journal of Innovative Science and Engineering
Subjects:
Online Access:http://jise.btu.edu.tr/en/download/article-file/3874269
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849251738438074368
author Osman Furkan KARAKUŞ
Ayla GULCU
Ali Can KARACA
author_facet Osman Furkan KARAKUŞ
Ayla GULCU
Ali Can KARACA
author_sort Osman Furkan KARAKUŞ
collection DOAJ
description This study introduces a novel approach for segmenting lines of text in handwritten documents using a vision transformer model. Specifically, we adapt DEtection TRansformer (DETR) model to detect line segments in images of handwritten documents. In order to adapt DETR for the line segmentation task, we applied a pre-processing step that involves dividing each line into fixed-size image patches followed by adding positional encoding. We benefit from DETR model with a ResNet-101 backbone pretrained on the Common Objects in Context (COCO) object detection training dataset, and re-train this model using our novel, complex line segmentation dataset consisting of 1,610 handwritten forms. To evaluate the performance, another line segmentation method named Bangla Document Recognition through Instance-level Segmentation of Handwritten Text Images (BN-DRISHTI) is implemented. This method utilizes the You Only Look Once (YOLO) object detection model. Both object detection-based methods involve a learning phase during which the model is trained or fine-tuned on the dataset. For a diverse set of baselines methods, we have also implemented two learning-free algorithms such as A* Search Algorithm and the Genetic Algorithm (GA). Experimental results based on the Intersection over Union (IoU) metric demonstrate that the proposed method outperforms all other methods in terms of the detection rate, recognition accuracy, and Text Line Detection Metric (TLDM). The quantitative results also indicate that two learning-free algorithms fail to segment highly skewed lines successfully in the dataset. The A* algorithm achieves a high recognition accuracy of 0.734, compared to GA and BN-DRISHTI, which achieve recognition accuracies of 0.498 and 0.689, respectively. Our proposed approach achieves the highest recognition accuracy of 0.872, outperforming all other methods. We show that the DETR model which requires only a single fine-tuning phase for adapting to line-segmentation task, not only simplifies the training and implementation process but also improves accuracy and efficiency in detecting and segmenting handwritten text lines. DETR’s use of a transformer’s global attention mechanism allows it to better understand the entire context of an image rather than relying solely on local features. This is particularly beneficial for managing the diverse and complex patterns found in handwritten text where traditional models might struggle with issues such as overlapping text lines or varied handwriting styles.
format Article
id doaj-art-c8f68bc65c214ca093668092136a9c85
institution Kabale University
issn 2602-4217
language English
publishDate 2025-06-01
publisher Bursa Technical University
record_format Article
series Journal of Innovative Science and Engineering
spelling doaj-art-c8f68bc65c214ca093668092136a9c852025-08-20T03:56:50ZengBursa Technical UniversityJournal of Innovative Science and Engineering2602-42172025-06-0191283810.38088/jise.1471047Adapting Vision Transformer-Based Object Detection Model for Handwritten Text Line Segmentation TaskOsman Furkan KARAKUŞ0https://orcid.org/0000-0003-3017-7715Ayla GULCU1https://orcid.org/0000-0003-3258-8681Ali Can KARACA2https://orcid.org/0000-0002-6835-7634Yıldız Technical UniversityBahcesehir UniversityYıldız Technical UniversityThis study introduces a novel approach for segmenting lines of text in handwritten documents using a vision transformer model. Specifically, we adapt DEtection TRansformer (DETR) model to detect line segments in images of handwritten documents. In order to adapt DETR for the line segmentation task, we applied a pre-processing step that involves dividing each line into fixed-size image patches followed by adding positional encoding. We benefit from DETR model with a ResNet-101 backbone pretrained on the Common Objects in Context (COCO) object detection training dataset, and re-train this model using our novel, complex line segmentation dataset consisting of 1,610 handwritten forms. To evaluate the performance, another line segmentation method named Bangla Document Recognition through Instance-level Segmentation of Handwritten Text Images (BN-DRISHTI) is implemented. This method utilizes the You Only Look Once (YOLO) object detection model. Both object detection-based methods involve a learning phase during which the model is trained or fine-tuned on the dataset. For a diverse set of baselines methods, we have also implemented two learning-free algorithms such as A* Search Algorithm and the Genetic Algorithm (GA). Experimental results based on the Intersection over Union (IoU) metric demonstrate that the proposed method outperforms all other methods in terms of the detection rate, recognition accuracy, and Text Line Detection Metric (TLDM). The quantitative results also indicate that two learning-free algorithms fail to segment highly skewed lines successfully in the dataset. The A* algorithm achieves a high recognition accuracy of 0.734, compared to GA and BN-DRISHTI, which achieve recognition accuracies of 0.498 and 0.689, respectively. Our proposed approach achieves the highest recognition accuracy of 0.872, outperforming all other methods. We show that the DETR model which requires only a single fine-tuning phase for adapting to line-segmentation task, not only simplifies the training and implementation process but also improves accuracy and efficiency in detecting and segmenting handwritten text lines. DETR’s use of a transformer’s global attention mechanism allows it to better understand the entire context of an image rather than relying solely on local features. This is particularly beneficial for managing the diverse and complex patterns found in handwritten text where traditional models might struggle with issues such as overlapping text lines or varied handwriting styles.http://jise.btu.edu.tr/en/download/article-file/3874269vision transformershandwritten text line segmentationobject detectionoptical character recognition
spellingShingle Osman Furkan KARAKUŞ
Ayla GULCU
Ali Can KARACA
Adapting Vision Transformer-Based Object Detection Model for Handwritten Text Line Segmentation Task
Journal of Innovative Science and Engineering
vision transformers
handwritten text line segmentation
object detection
optical character recognition
title Adapting Vision Transformer-Based Object Detection Model for Handwritten Text Line Segmentation Task
title_full Adapting Vision Transformer-Based Object Detection Model for Handwritten Text Line Segmentation Task
title_fullStr Adapting Vision Transformer-Based Object Detection Model for Handwritten Text Line Segmentation Task
title_full_unstemmed Adapting Vision Transformer-Based Object Detection Model for Handwritten Text Line Segmentation Task
title_short Adapting Vision Transformer-Based Object Detection Model for Handwritten Text Line Segmentation Task
title_sort adapting vision transformer based object detection model for handwritten text line segmentation task
topic vision transformers
handwritten text line segmentation
object detection
optical character recognition
url http://jise.btu.edu.tr/en/download/article-file/3874269
work_keys_str_mv AT osmanfurkankarakus adaptingvisiontransformerbasedobjectdetectionmodelforhandwrittentextlinesegmentationtask
AT aylagulcu adaptingvisiontransformerbasedobjectdetectionmodelforhandwrittentextlinesegmentationtask
AT alicankaraca adaptingvisiontransformerbasedobjectdetectionmodelforhandwrittentextlinesegmentationtask