Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction

Natural language processing (NLP) augments text data to overcome sample size constraints. Scarce and low-quality data present particular challenges when learning from these domains. Increasing the sample size is a natural and widely used strategy for alleviating these challenges. Moreover, data-augm...

Full description

Saved in:

Bibliographic Details
Main Authors:	Ahlam Alrehili, Areej Alhothali
Format:	Article
Language:	English
Published:	PeerJ Inc. 2025-03-01
Series:	PeerJ Computer Science
Subjects:	Arabic grammatical error correction GEC Corpus ChatGPT NLP AraGEC
Online Access:	https://peerj.com/articles/cs-2724.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849766898422513664
author	Ahlam Alrehili Areej Alhothali
author_facet	Ahlam Alrehili Areej Alhothali
author_sort	Ahlam Alrehili
collection	DOAJ
description	Natural language processing (NLP) augments text data to overcome sample size constraints. Scarce and low-quality data present particular challenges when learning from these domains. Increasing the sample size is a natural and widely used strategy for alleviating these challenges. Moreover, data-augmentation techniques are commonly used in languages with rich data resources to address problems such as exposure bias. In this study, we chose Arabic to increase the sample size and correct grammatical errors. Arabic is considered one of the languages with limited resources for grammatical error correction (GEC) despite being one of the most popular among Arabs and non-Arabs because of its close connection to Islam. Therefore, this study aims to develop an Arabic corpus called “Tibyan” for grammatical error correction using ChatGPT. ChatGPT is used as a data augmenter tool based on a pair of Arabic sentences containing grammatical errors matched with a sentence free of errors extracted from Arabic books, called guide sentences. Multiple steps were involved in establishing our corpus, including collecting and pre-processing a pair of Arabic texts from various sources, such as books and open-access corpora. We then used ChatGPT to generate a parallel corpus based on the text collected previously, as a guide for generating sentences with multiple types of errors. By engaging linguistic experts to review and validate the automatically generated sentences, we ensured they were correct and error-free. The corpus was validated and refined iteratively based on feedback provided by linguistic experts to improve its accuracy. Finally, we used the Arabic Error Type Annotation tool (ARETA) to analyze the types of errors in the Tibyan corpus. Our corpus contained 49% of errors, including seven types: orthography, morphology, syntax, semantics, punctuation, merge, and split. The Tibyan corpus contains approximately 600 K tokens.
format	Article
id	doaj-art-7d0bfeb2c46146ffa2ce81243513627b
institution	DOAJ
issn	2376-5992
language	English
publishDate	2025-03-01
publisher	PeerJ Inc.
record_format	Article
series	PeerJ Computer Science
spelling	doaj-art-7d0bfeb2c46146ffa2ce81243513627b2025-08-20T03:04:26ZengPeerJ Inc.PeerJ Computer Science2376-59922025-03-0111e272410.7717/peerj-cs.2724Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correctionAhlam Alrehili0Areej Alhothali1Department of Computer Sciences, Faculty of Computing and Information Technology, King Abdul Aziz University, Jeddah, Saudi ArabiaDepartment of Computer Sciences, Faculty of Computing and Information Technology, King Abdul Aziz University, Jeddah, Saudi ArabiaNatural language processing (NLP) augments text data to overcome sample size constraints. Scarce and low-quality data present particular challenges when learning from these domains. Increasing the sample size is a natural and widely used strategy for alleviating these challenges. Moreover, data-augmentation techniques are commonly used in languages with rich data resources to address problems such as exposure bias. In this study, we chose Arabic to increase the sample size and correct grammatical errors. Arabic is considered one of the languages with limited resources for grammatical error correction (GEC) despite being one of the most popular among Arabs and non-Arabs because of its close connection to Islam. Therefore, this study aims to develop an Arabic corpus called “Tibyan” for grammatical error correction using ChatGPT. ChatGPT is used as a data augmenter tool based on a pair of Arabic sentences containing grammatical errors matched with a sentence free of errors extracted from Arabic books, called guide sentences. Multiple steps were involved in establishing our corpus, including collecting and pre-processing a pair of Arabic texts from various sources, such as books and open-access corpora. We then used ChatGPT to generate a parallel corpus based on the text collected previously, as a guide for generating sentences with multiple types of errors. By engaging linguistic experts to review and validate the automatically generated sentences, we ensured they were correct and error-free. The corpus was validated and refined iteratively based on feedback provided by linguistic experts to improve its accuracy. Finally, we used the Arabic Error Type Annotation tool (ARETA) to analyze the types of errors in the Tibyan corpus. Our corpus contained 49% of errors, including seven types: orthography, morphology, syntax, semantics, punctuation, merge, and split. The Tibyan corpus contains approximately 600 K tokens.https://peerj.com/articles/cs-2724.pdfArabic grammatical error correctionGECCorpusChatGPTNLPAraGEC
spellingShingle	Ahlam Alrehili Areej Alhothali Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction PeerJ Computer Science Arabic grammatical error correction GEC Corpus ChatGPT NLP AraGEC
title	Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction
title_full	Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction
title_fullStr	Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction
title_full_unstemmed	Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction
title_short	Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction
title_sort	tibyan corpus balanced and comprehensive error coverage corpus using chatgpt for arabic grammatical error correction
topic	Arabic grammatical error correction GEC Corpus ChatGPT NLP AraGEC
url	https://peerj.com/articles/cs-2724.pdf
work_keys_str_mv	AT ahlamalrehili tibyancorpusbalancedandcomprehensiveerrorcoveragecorpususingchatgptforarabicgrammaticalerrorcorrection AT areejalhothali tibyancorpusbalancedandcomprehensiveerrorcoveragecorpususingchatgptforarabicgrammaticalerrorcorrection

Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction

Similar Items