Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction

Natural language processing (NLP) augments text data to overcome sample size constraints. Scarce and low-quality data present particular challenges when learning from these domains. Increasing the sample size is a natural and widely used strategy for alleviating these challenges. Moreover, data-augm...

Full description

Saved in:
Bibliographic Details
Main Authors: Ahlam Alrehili, Areej Alhothali
Format: Article
Language:English
Published: PeerJ Inc. 2025-03-01
Series:PeerJ Computer Science
Subjects:
Online Access:https://peerj.com/articles/cs-2724.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849766898422513664
author Ahlam Alrehili
Areej Alhothali
author_facet Ahlam Alrehili
Areej Alhothali
author_sort Ahlam Alrehili
collection DOAJ
description Natural language processing (NLP) augments text data to overcome sample size constraints. Scarce and low-quality data present particular challenges when learning from these domains. Increasing the sample size is a natural and widely used strategy for alleviating these challenges. Moreover, data-augmentation techniques are commonly used in languages with rich data resources to address problems such as exposure bias. In this study, we chose Arabic to increase the sample size and correct grammatical errors. Arabic is considered one of the languages with limited resources for grammatical error correction (GEC) despite being one of the most popular among Arabs and non-Arabs because of its close connection to Islam. Therefore, this study aims to develop an Arabic corpus called “Tibyan” for grammatical error correction using ChatGPT. ChatGPT is used as a data augmenter tool based on a pair of Arabic sentences containing grammatical errors matched with a sentence free of errors extracted from Arabic books, called guide sentences. Multiple steps were involved in establishing our corpus, including collecting and pre-processing a pair of Arabic texts from various sources, such as books and open-access corpora. We then used ChatGPT to generate a parallel corpus based on the text collected previously, as a guide for generating sentences with multiple types of errors. By engaging linguistic experts to review and validate the automatically generated sentences, we ensured they were correct and error-free. The corpus was validated and refined iteratively based on feedback provided by linguistic experts to improve its accuracy. Finally, we used the Arabic Error Type Annotation tool (ARETA) to analyze the types of errors in the Tibyan corpus. Our corpus contained 49% of errors, including seven types: orthography, morphology, syntax, semantics, punctuation, merge, and split. The Tibyan corpus contains approximately 600 K tokens.
format Article
id doaj-art-7d0bfeb2c46146ffa2ce81243513627b
institution DOAJ
issn 2376-5992
language English
publishDate 2025-03-01
publisher PeerJ Inc.
record_format Article
series PeerJ Computer Science
spelling doaj-art-7d0bfeb2c46146ffa2ce81243513627b2025-08-20T03:04:26ZengPeerJ Inc.PeerJ Computer Science2376-59922025-03-0111e272410.7717/peerj-cs.2724Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correctionAhlam Alrehili0Areej Alhothali1Department of Computer Sciences, Faculty of Computing and Information Technology, King Abdul Aziz University, Jeddah, Saudi ArabiaDepartment of Computer Sciences, Faculty of Computing and Information Technology, King Abdul Aziz University, Jeddah, Saudi ArabiaNatural language processing (NLP) augments text data to overcome sample size constraints. Scarce and low-quality data present particular challenges when learning from these domains. Increasing the sample size is a natural and widely used strategy for alleviating these challenges. Moreover, data-augmentation techniques are commonly used in languages with rich data resources to address problems such as exposure bias. In this study, we chose Arabic to increase the sample size and correct grammatical errors. Arabic is considered one of the languages with limited resources for grammatical error correction (GEC) despite being one of the most popular among Arabs and non-Arabs because of its close connection to Islam. Therefore, this study aims to develop an Arabic corpus called “Tibyan” for grammatical error correction using ChatGPT. ChatGPT is used as a data augmenter tool based on a pair of Arabic sentences containing grammatical errors matched with a sentence free of errors extracted from Arabic books, called guide sentences. Multiple steps were involved in establishing our corpus, including collecting and pre-processing a pair of Arabic texts from various sources, such as books and open-access corpora. We then used ChatGPT to generate a parallel corpus based on the text collected previously, as a guide for generating sentences with multiple types of errors. By engaging linguistic experts to review and validate the automatically generated sentences, we ensured they were correct and error-free. The corpus was validated and refined iteratively based on feedback provided by linguistic experts to improve its accuracy. Finally, we used the Arabic Error Type Annotation tool (ARETA) to analyze the types of errors in the Tibyan corpus. Our corpus contained 49% of errors, including seven types: orthography, morphology, syntax, semantics, punctuation, merge, and split. The Tibyan corpus contains approximately 600 K tokens.https://peerj.com/articles/cs-2724.pdfArabic grammatical error correctionGECCorpusChatGPTNLPAraGEC
spellingShingle Ahlam Alrehili
Areej Alhothali
Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction
PeerJ Computer Science
Arabic grammatical error correction
GEC
Corpus
ChatGPT
NLP
AraGEC
title Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction
title_full Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction
title_fullStr Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction
title_full_unstemmed Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction
title_short Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction
title_sort tibyan corpus balanced and comprehensive error coverage corpus using chatgpt for arabic grammatical error correction
topic Arabic grammatical error correction
GEC
Corpus
ChatGPT
NLP
AraGEC
url https://peerj.com/articles/cs-2724.pdf
work_keys_str_mv AT ahlamalrehili tibyancorpusbalancedandcomprehensiveerrorcoveragecorpususingchatgptforarabicgrammaticalerrorcorrection
AT areejalhothali tibyancorpusbalancedandcomprehensiveerrorcoveragecorpususingchatgptforarabicgrammaticalerrorcorrection