Enhanced OCR Recognition for Madurese Text Documents: A Genetic Algorithm Approach with Tesseract 5.5

Character Recognition (OCR) for the Madurese language using Genetic Algorithms (GA). The study addresses the challenges in processing Madurese text documents by implementing a nine-step image preprocessing workflow optimized through GA. Our methodology combines rescaling, grayscale conversion, adapt...

Full description

Saved in:
Bibliographic Details
Main Authors: Muhammad Nazir Arifin, Muhammad Umar Mansyur, Ali Rahman, Nindian Puspa Dewi, Fauzan Prasetyo Eka Putra
Format: Article
Language:Indonesian
Published: Universitas Muhammadiyah Purwokerto 2025-08-01
Series:Jurnal Informatika
Subjects:
Online Access:http://jurnalnasional.ump.ac.id/index.php/JUITA/article/view/25794
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850036441657114624
author Muhammad Nazir Arifin
Muhammad Umar Mansyur
Ali Rahman
Nindian Puspa Dewi
Fauzan Prasetyo Eka Putra
author_facet Muhammad Nazir Arifin
Muhammad Umar Mansyur
Ali Rahman
Nindian Puspa Dewi
Fauzan Prasetyo Eka Putra
author_sort Muhammad Nazir Arifin
collection DOAJ
description Character Recognition (OCR) for the Madurese language using Genetic Algorithms (GA). The study addresses the challenges in processing Madurese text documents by implementing a nine-step image preprocessing workflow optimized through GA. Our methodology combines rescaling, grayscale conversion, adaptive thresholding, deskewing, median blur, Otsu thresholding, border removal, contrast enhancement, and noise reduction, with the sequence determined by GA optimization. The system utilizes Tesseract 5.5 OCR engine configured with Vietnamese language model parameters to accommodate Maderese writing characteristics. Experiments conducted on a dataset of 500 images demonstrated significant improvements in recognition accuracy. The GA-optimized preprocessing sequence achieved a 24.32% Word Error Rate (WER) and 7.47% Character Error Rate (CER), marking substantial improvements over the baseline Tesseract implementation. Further optimization through language model selection, particularly using the Occitan (OCI) model, yielded 100% accuracy in specific test cases. The research also explored various fitness function configurations, with a 0.7:0.3 WER-to-CER ratio proving most effective. These results demonstrate the potential of GA optimization in enhancing OCR performance for regional languages with unique characteristics, contributing to the broader field of document digitization and language preservation
format Article
id doaj-art-33ae505c2a52458d849ae23e02ab7d5b
institution DOAJ
issn 2086-9398
2579-8901
language Indonesian
publishDate 2025-08-01
publisher Universitas Muhammadiyah Purwokerto
record_format Article
series Jurnal Informatika
spelling doaj-art-33ae505c2a52458d849ae23e02ab7d5b2025-08-20T02:57:08ZindUniversitas Muhammadiyah PurwokertoJurnal Informatika2086-93982579-89012025-08-0110911810.30595/juita.v13i2.2579420844Enhanced OCR Recognition for Madurese Text Documents: A Genetic Algorithm Approach with Tesseract 5.5Muhammad Nazir ArifinMuhammad Umar MansyurAli RahmanNindian Puspa DewiFauzan Prasetyo Eka PutraCharacter Recognition (OCR) for the Madurese language using Genetic Algorithms (GA). The study addresses the challenges in processing Madurese text documents by implementing a nine-step image preprocessing workflow optimized through GA. Our methodology combines rescaling, grayscale conversion, adaptive thresholding, deskewing, median blur, Otsu thresholding, border removal, contrast enhancement, and noise reduction, with the sequence determined by GA optimization. The system utilizes Tesseract 5.5 OCR engine configured with Vietnamese language model parameters to accommodate Maderese writing characteristics. Experiments conducted on a dataset of 500 images demonstrated significant improvements in recognition accuracy. The GA-optimized preprocessing sequence achieved a 24.32% Word Error Rate (WER) and 7.47% Character Error Rate (CER), marking substantial improvements over the baseline Tesseract implementation. Further optimization through language model selection, particularly using the Occitan (OCI) model, yielded 100% accuracy in specific test cases. The research also explored various fitness function configurations, with a 0.7:0.3 WER-to-CER ratio proving most effective. These results demonstrate the potential of GA optimization in enhancing OCR performance for regional languages with unique characteristics, contributing to the broader field of document digitization and language preservationhttp://jurnalnasional.ump.ac.id/index.php/JUITA/article/view/25794image preprocessingoptical character recognitiongenetic algorithm optimizationmadurese language processingtesseract ocr
spellingShingle Muhammad Nazir Arifin
Muhammad Umar Mansyur
Ali Rahman
Nindian Puspa Dewi
Fauzan Prasetyo Eka Putra
Enhanced OCR Recognition for Madurese Text Documents: A Genetic Algorithm Approach with Tesseract 5.5
Jurnal Informatika
image preprocessing
optical character recognition
genetic algorithm optimization
madurese language processing
tesseract ocr
title Enhanced OCR Recognition for Madurese Text Documents: A Genetic Algorithm Approach with Tesseract 5.5
title_full Enhanced OCR Recognition for Madurese Text Documents: A Genetic Algorithm Approach with Tesseract 5.5
title_fullStr Enhanced OCR Recognition for Madurese Text Documents: A Genetic Algorithm Approach with Tesseract 5.5
title_full_unstemmed Enhanced OCR Recognition for Madurese Text Documents: A Genetic Algorithm Approach with Tesseract 5.5
title_short Enhanced OCR Recognition for Madurese Text Documents: A Genetic Algorithm Approach with Tesseract 5.5
title_sort enhanced ocr recognition for madurese text documents a genetic algorithm approach with tesseract 5 5
topic image preprocessing
optical character recognition
genetic algorithm optimization
madurese language processing
tesseract ocr
url http://jurnalnasional.ump.ac.id/index.php/JUITA/article/view/25794
work_keys_str_mv AT muhammadnazirarifin enhancedocrrecognitionformaduresetextdocumentsageneticalgorithmapproachwithtesseract55
AT muhammadumarmansyur enhancedocrrecognitionformaduresetextdocumentsageneticalgorithmapproachwithtesseract55
AT alirahman enhancedocrrecognitionformaduresetextdocumentsageneticalgorithmapproachwithtesseract55
AT nindianpuspadewi enhancedocrrecognitionformaduresetextdocumentsageneticalgorithmapproachwithtesseract55
AT fauzanprasetyoekaputra enhancedocrrecognitionformaduresetextdocumentsageneticalgorithmapproachwithtesseract55