Enhanced OCR Recognition for Madurese Text Documents: A Genetic Algorithm Approach with Tesseract 5.5
Character Recognition (OCR) for the Madurese language using Genetic Algorithms (GA). The study addresses the challenges in processing Madurese text documents by implementing a nine-step image preprocessing workflow optimized through GA. Our methodology combines rescaling, grayscale conversion, adapt...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | Indonesian |
| Published: |
Universitas Muhammadiyah Purwokerto
2025-08-01
|
| Series: | Jurnal Informatika |
| Subjects: | |
| Online Access: | http://jurnalnasional.ump.ac.id/index.php/JUITA/article/view/25794 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850036441657114624 |
|---|---|
| author | Muhammad Nazir Arifin Muhammad Umar Mansyur Ali Rahman Nindian Puspa Dewi Fauzan Prasetyo Eka Putra |
| author_facet | Muhammad Nazir Arifin Muhammad Umar Mansyur Ali Rahman Nindian Puspa Dewi Fauzan Prasetyo Eka Putra |
| author_sort | Muhammad Nazir Arifin |
| collection | DOAJ |
| description | Character Recognition (OCR) for the Madurese language using Genetic Algorithms (GA). The study addresses the challenges in processing Madurese text documents by implementing a nine-step image preprocessing workflow optimized through GA. Our methodology combines rescaling, grayscale conversion, adaptive thresholding, deskewing, median blur, Otsu thresholding, border removal, contrast enhancement, and noise reduction, with the sequence determined by GA optimization. The system utilizes Tesseract 5.5 OCR engine configured with Vietnamese language model parameters to accommodate Maderese writing characteristics. Experiments conducted on a dataset of 500 images demonstrated significant improvements in recognition accuracy. The GA-optimized preprocessing sequence achieved a 24.32% Word Error Rate (WER) and 7.47% Character Error Rate (CER), marking substantial improvements over the baseline Tesseract implementation. Further optimization through language model selection, particularly using the Occitan (OCI) model, yielded 100% accuracy in specific test cases. The research also explored various fitness function configurations, with a 0.7:0.3 WER-to-CER ratio proving most effective. These results demonstrate the potential of GA optimization in enhancing OCR performance for regional languages with unique characteristics, contributing to the broader field of document digitization and language preservation |
| format | Article |
| id | doaj-art-33ae505c2a52458d849ae23e02ab7d5b |
| institution | DOAJ |
| issn | 2086-9398 2579-8901 |
| language | Indonesian |
| publishDate | 2025-08-01 |
| publisher | Universitas Muhammadiyah Purwokerto |
| record_format | Article |
| series | Jurnal Informatika |
| spelling | doaj-art-33ae505c2a52458d849ae23e02ab7d5b2025-08-20T02:57:08ZindUniversitas Muhammadiyah PurwokertoJurnal Informatika2086-93982579-89012025-08-0110911810.30595/juita.v13i2.2579420844Enhanced OCR Recognition for Madurese Text Documents: A Genetic Algorithm Approach with Tesseract 5.5Muhammad Nazir ArifinMuhammad Umar MansyurAli RahmanNindian Puspa DewiFauzan Prasetyo Eka PutraCharacter Recognition (OCR) for the Madurese language using Genetic Algorithms (GA). The study addresses the challenges in processing Madurese text documents by implementing a nine-step image preprocessing workflow optimized through GA. Our methodology combines rescaling, grayscale conversion, adaptive thresholding, deskewing, median blur, Otsu thresholding, border removal, contrast enhancement, and noise reduction, with the sequence determined by GA optimization. The system utilizes Tesseract 5.5 OCR engine configured with Vietnamese language model parameters to accommodate Maderese writing characteristics. Experiments conducted on a dataset of 500 images demonstrated significant improvements in recognition accuracy. The GA-optimized preprocessing sequence achieved a 24.32% Word Error Rate (WER) and 7.47% Character Error Rate (CER), marking substantial improvements over the baseline Tesseract implementation. Further optimization through language model selection, particularly using the Occitan (OCI) model, yielded 100% accuracy in specific test cases. The research also explored various fitness function configurations, with a 0.7:0.3 WER-to-CER ratio proving most effective. These results demonstrate the potential of GA optimization in enhancing OCR performance for regional languages with unique characteristics, contributing to the broader field of document digitization and language preservationhttp://jurnalnasional.ump.ac.id/index.php/JUITA/article/view/25794image preprocessingoptical character recognitiongenetic algorithm optimizationmadurese language processingtesseract ocr |
| spellingShingle | Muhammad Nazir Arifin Muhammad Umar Mansyur Ali Rahman Nindian Puspa Dewi Fauzan Prasetyo Eka Putra Enhanced OCR Recognition for Madurese Text Documents: A Genetic Algorithm Approach with Tesseract 5.5 Jurnal Informatika image preprocessing optical character recognition genetic algorithm optimization madurese language processing tesseract ocr |
| title | Enhanced OCR Recognition for Madurese Text Documents: A Genetic Algorithm Approach with Tesseract 5.5 |
| title_full | Enhanced OCR Recognition for Madurese Text Documents: A Genetic Algorithm Approach with Tesseract 5.5 |
| title_fullStr | Enhanced OCR Recognition for Madurese Text Documents: A Genetic Algorithm Approach with Tesseract 5.5 |
| title_full_unstemmed | Enhanced OCR Recognition for Madurese Text Documents: A Genetic Algorithm Approach with Tesseract 5.5 |
| title_short | Enhanced OCR Recognition for Madurese Text Documents: A Genetic Algorithm Approach with Tesseract 5.5 |
| title_sort | enhanced ocr recognition for madurese text documents a genetic algorithm approach with tesseract 5 5 |
| topic | image preprocessing optical character recognition genetic algorithm optimization madurese language processing tesseract ocr |
| url | http://jurnalnasional.ump.ac.id/index.php/JUITA/article/view/25794 |
| work_keys_str_mv | AT muhammadnazirarifin enhancedocrrecognitionformaduresetextdocumentsageneticalgorithmapproachwithtesseract55 AT muhammadumarmansyur enhancedocrrecognitionformaduresetextdocumentsageneticalgorithmapproachwithtesseract55 AT alirahman enhancedocrrecognitionformaduresetextdocumentsageneticalgorithmapproachwithtesseract55 AT nindianpuspadewi enhancedocrrecognitionformaduresetextdocumentsageneticalgorithmapproachwithtesseract55 AT fauzanprasetyoekaputra enhancedocrrecognitionformaduresetextdocumentsageneticalgorithmapproachwithtesseract55 |