Leveraging large language models for spelling correction in Turkish

The field of natural language processing (NLP) has rapidly progressed, particularly with the rise of large language models (LLMs), which enhance our understanding of the intrinsic structures of languages in a cross-linguistic manner for complex NLP tasks. However, commonly encountered misspellings i...

Full description

Saved in:
Bibliographic Details
Main Author: Ceren Guzel Turhan
Format: Article
Language:English
Published: PeerJ Inc. 2025-06-01
Series:PeerJ Computer Science
Subjects:
Online Access:https://peerj.com/articles/cs-2889.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849422562954575872
author Ceren Guzel Turhan
author_facet Ceren Guzel Turhan
author_sort Ceren Guzel Turhan
collection DOAJ
description The field of natural language processing (NLP) has rapidly progressed, particularly with the rise of large language models (LLMs), which enhance our understanding of the intrinsic structures of languages in a cross-linguistic manner for complex NLP tasks. However, commonly encountered misspellings in human-written texts adversely affect language understanding for LLMs for various NLP tasks as well as misspelling applications such as auto-proofreading and chatbots. Therefore, this study focuses on the task of spelling correction in the agglutinative language Turkish, where its nature makes spell correction significantly more challenging. To address this, the research introduces a novel dataset, referred to as NoisyWikiTr, to explore encoder-only models based on bidirectional encoder representations from transformers (BERT) and existing auto-correction tools. For the first time in this study, as far as is known, encoder-only models based on BERT are presented as subword prediction models, and encoder-decoder models based on text-cleaning (Text-to-Text Transfer Transformer) architecture are fine-tuned for this task in Turkish. A comprehensive comparison of these models highlights the advantages of context-based approaches over traditional, context-free auto-correction tools. The findings also reveal that among LLMs, a language-specific sequence-to-sequence model outperforms both cross-lingual sequence-to-sequence models and encoder-only models in handling realistic misspellings.
format Article
id doaj-art-3eb29958dd834526bd59a344f528f0a6
institution Kabale University
issn 2376-5992
language English
publishDate 2025-06-01
publisher PeerJ Inc.
record_format Article
series PeerJ Computer Science
spelling doaj-art-3eb29958dd834526bd59a344f528f0a62025-08-20T03:31:02ZengPeerJ Inc.PeerJ Computer Science2376-59922025-06-0111e288910.7717/peerj-cs.2889Leveraging large language models for spelling correction in TurkishCeren Guzel Turhan0Department of Computer Engineering, Gazi University, Ankara, TurkeyThe field of natural language processing (NLP) has rapidly progressed, particularly with the rise of large language models (LLMs), which enhance our understanding of the intrinsic structures of languages in a cross-linguistic manner for complex NLP tasks. However, commonly encountered misspellings in human-written texts adversely affect language understanding for LLMs for various NLP tasks as well as misspelling applications such as auto-proofreading and chatbots. Therefore, this study focuses on the task of spelling correction in the agglutinative language Turkish, where its nature makes spell correction significantly more challenging. To address this, the research introduces a novel dataset, referred to as NoisyWikiTr, to explore encoder-only models based on bidirectional encoder representations from transformers (BERT) and existing auto-correction tools. For the first time in this study, as far as is known, encoder-only models based on BERT are presented as subword prediction models, and encoder-decoder models based on text-cleaning (Text-to-Text Transfer Transformer) architecture are fine-tuned for this task in Turkish. A comprehensive comparison of these models highlights the advantages of context-based approaches over traditional, context-free auto-correction tools. The findings also reveal that among LLMs, a language-specific sequence-to-sequence model outperforms both cross-lingual sequence-to-sequence models and encoder-only models in handling realistic misspellings.https://peerj.com/articles/cs-2889.pdfMasked language modelingEncoder-only LLMsEncoder-decoder sequence-to-sequence modelsSpell checkSpell correction
spellingShingle Ceren Guzel Turhan
Leveraging large language models for spelling correction in Turkish
PeerJ Computer Science
Masked language modeling
Encoder-only LLMs
Encoder-decoder sequence-to-sequence models
Spell check
Spell correction
title Leveraging large language models for spelling correction in Turkish
title_full Leveraging large language models for spelling correction in Turkish
title_fullStr Leveraging large language models for spelling correction in Turkish
title_full_unstemmed Leveraging large language models for spelling correction in Turkish
title_short Leveraging large language models for spelling correction in Turkish
title_sort leveraging large language models for spelling correction in turkish
topic Masked language modeling
Encoder-only LLMs
Encoder-decoder sequence-to-sequence models
Spell check
Spell correction
url https://peerj.com/articles/cs-2889.pdf
work_keys_str_mv AT cerenguzelturhan leveraginglargelanguagemodelsforspellingcorrectioninturkish