Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction
Natural language processing (NLP) augments text data to overcome sample size constraints. Scarce and low-quality data present particular challenges when learning from these domains. Increasing the sample size is a natural and widely used strategy for alleviating these challenges. Moreover, data-augm...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
PeerJ Inc.
2025-03-01
|
| Series: | PeerJ Computer Science |
| Subjects: | |
| Online Access: | https://peerj.com/articles/cs-2724.pdf |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849766898422513664 |
|---|---|
| author | Ahlam Alrehili Areej Alhothali |
| author_facet | Ahlam Alrehili Areej Alhothali |
| author_sort | Ahlam Alrehili |
| collection | DOAJ |
| description | Natural language processing (NLP) augments text data to overcome sample size constraints. Scarce and low-quality data present particular challenges when learning from these domains. Increasing the sample size is a natural and widely used strategy for alleviating these challenges. Moreover, data-augmentation techniques are commonly used in languages with rich data resources to address problems such as exposure bias. In this study, we chose Arabic to increase the sample size and correct grammatical errors. Arabic is considered one of the languages with limited resources for grammatical error correction (GEC) despite being one of the most popular among Arabs and non-Arabs because of its close connection to Islam. Therefore, this study aims to develop an Arabic corpus called “Tibyan” for grammatical error correction using ChatGPT. ChatGPT is used as a data augmenter tool based on a pair of Arabic sentences containing grammatical errors matched with a sentence free of errors extracted from Arabic books, called guide sentences. Multiple steps were involved in establishing our corpus, including collecting and pre-processing a pair of Arabic texts from various sources, such as books and open-access corpora. We then used ChatGPT to generate a parallel corpus based on the text collected previously, as a guide for generating sentences with multiple types of errors. By engaging linguistic experts to review and validate the automatically generated sentences, we ensured they were correct and error-free. The corpus was validated and refined iteratively based on feedback provided by linguistic experts to improve its accuracy. Finally, we used the Arabic Error Type Annotation tool (ARETA) to analyze the types of errors in the Tibyan corpus. Our corpus contained 49% of errors, including seven types: orthography, morphology, syntax, semantics, punctuation, merge, and split. The Tibyan corpus contains approximately 600 K tokens. |
| format | Article |
| id | doaj-art-7d0bfeb2c46146ffa2ce81243513627b |
| institution | DOAJ |
| issn | 2376-5992 |
| language | English |
| publishDate | 2025-03-01 |
| publisher | PeerJ Inc. |
| record_format | Article |
| series | PeerJ Computer Science |
| spelling | doaj-art-7d0bfeb2c46146ffa2ce81243513627b2025-08-20T03:04:26ZengPeerJ Inc.PeerJ Computer Science2376-59922025-03-0111e272410.7717/peerj-cs.2724Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correctionAhlam Alrehili0Areej Alhothali1Department of Computer Sciences, Faculty of Computing and Information Technology, King Abdul Aziz University, Jeddah, Saudi ArabiaDepartment of Computer Sciences, Faculty of Computing and Information Technology, King Abdul Aziz University, Jeddah, Saudi ArabiaNatural language processing (NLP) augments text data to overcome sample size constraints. Scarce and low-quality data present particular challenges when learning from these domains. Increasing the sample size is a natural and widely used strategy for alleviating these challenges. Moreover, data-augmentation techniques are commonly used in languages with rich data resources to address problems such as exposure bias. In this study, we chose Arabic to increase the sample size and correct grammatical errors. Arabic is considered one of the languages with limited resources for grammatical error correction (GEC) despite being one of the most popular among Arabs and non-Arabs because of its close connection to Islam. Therefore, this study aims to develop an Arabic corpus called “Tibyan” for grammatical error correction using ChatGPT. ChatGPT is used as a data augmenter tool based on a pair of Arabic sentences containing grammatical errors matched with a sentence free of errors extracted from Arabic books, called guide sentences. Multiple steps were involved in establishing our corpus, including collecting and pre-processing a pair of Arabic texts from various sources, such as books and open-access corpora. We then used ChatGPT to generate a parallel corpus based on the text collected previously, as a guide for generating sentences with multiple types of errors. By engaging linguistic experts to review and validate the automatically generated sentences, we ensured they were correct and error-free. The corpus was validated and refined iteratively based on feedback provided by linguistic experts to improve its accuracy. Finally, we used the Arabic Error Type Annotation tool (ARETA) to analyze the types of errors in the Tibyan corpus. Our corpus contained 49% of errors, including seven types: orthography, morphology, syntax, semantics, punctuation, merge, and split. The Tibyan corpus contains approximately 600 K tokens.https://peerj.com/articles/cs-2724.pdfArabic grammatical error correctionGECCorpusChatGPTNLPAraGEC |
| spellingShingle | Ahlam Alrehili Areej Alhothali Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction PeerJ Computer Science Arabic grammatical error correction GEC Corpus ChatGPT NLP AraGEC |
| title | Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction |
| title_full | Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction |
| title_fullStr | Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction |
| title_full_unstemmed | Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction |
| title_short | Tibyan corpus: balanced and comprehensive error coverage corpus using ChatGPT for Arabic grammatical error correction |
| title_sort | tibyan corpus balanced and comprehensive error coverage corpus using chatgpt for arabic grammatical error correction |
| topic | Arabic grammatical error correction GEC Corpus ChatGPT NLP AraGEC |
| url | https://peerj.com/articles/cs-2724.pdf |
| work_keys_str_mv | AT ahlamalrehili tibyancorpusbalancedandcomprehensiveerrorcoveragecorpususingchatgptforarabicgrammaticalerrorcorrection AT areejalhothali tibyancorpusbalancedandcomprehensiveerrorcoveragecorpususingchatgptforarabicgrammaticalerrorcorrection |