Show Me All Writing Errors: A Two-Phased Grammatical Error Corrector for Romanian
Nowadays, grammatical error correction (GEC) has a significant role in writing since even native speakers often face challenges with proficient writing. This research is focused on developing a methodology to correct grammatical errors in the Romanian language, a less-resourced language for which th...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-03-01
|
| Series: | Information |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2078-2489/16/3/242 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849341566795120640 |
|---|---|
| author | Mihai-Cristian Tudose Stefan Ruseti Mihai Dascalu |
| author_facet | Mihai-Cristian Tudose Stefan Ruseti Mihai Dascalu |
| author_sort | Mihai-Cristian Tudose |
| collection | DOAJ |
| description | Nowadays, grammatical error correction (GEC) has a significant role in writing since even native speakers often face challenges with proficient writing. This research is focused on developing a methodology to correct grammatical errors in the Romanian language, a less-resourced language for which there are currently no up-to-date GEC solutions. Our main contributions include an open-source synthetic dataset of 345,403 Romanian sentences, a manually curated dataset of 3054 social media comments, a two-phased GEC approach, and a comparison with several Romanian models, including RoMistral and RoLama3, but also LanguageTool, GPT-4o mini, and GPT-4o. We consider a synthetic dataset to finetune our models, while we rely on two real-life datasets with genuine human mistakes (i.e., CNA and RoComments) to evaluate performance. Building an artificial dataset was necessary because of the scarcity of real-life mistake datasets, whereas introducing RoComments, a new genuine dataset, is argued by the necessity to cover errors amongst native speakers encountered in social media comments. We also introduce a two-phased approach, where we first identify the location of erroneous tokens in the sentence; next, the erroneous tokens are replaced by an encoder–decoder model. Our approach achieved an <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>F</mi><mrow><mn>0.5</mn></mrow></msub></semantics></math></inline-formula> of 0.57 on CNA and 0.64 on RoComments, surpassing by a considerable margin LanguageTool as well as an end-to-end version based on Flan-T5 and mT0 in most setups. While our two-phased method did not outperform GPT-4o, arguably by its smaller size and language exposure, it obtained on-par results with GPT-4o mini and achieved higher performance than all Romanian LLMs. |
| format | Article |
| id | doaj-art-ffd3270b208c4ceb9c899ee549400fa4 |
| institution | Kabale University |
| issn | 2078-2489 |
| language | English |
| publishDate | 2025-03-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Information |
| spelling | doaj-art-ffd3270b208c4ceb9c899ee549400fa42025-08-20T03:43:36ZengMDPI AGInformation2078-24892025-03-0116324210.3390/info16030242Show Me All Writing Errors: A Two-Phased Grammatical Error Corrector for RomanianMihai-Cristian Tudose0Stefan Ruseti1Mihai Dascalu2Faculty of Automatic Control and Computers, National University of Science and Technology POLITEHNICA Bucharest, 313 Splaiul Independentei, 060042 Bucharest, RomaniaFaculty of Automatic Control and Computers, National University of Science and Technology POLITEHNICA Bucharest, 313 Splaiul Independentei, 060042 Bucharest, RomaniaFaculty of Automatic Control and Computers, National University of Science and Technology POLITEHNICA Bucharest, 313 Splaiul Independentei, 060042 Bucharest, RomaniaNowadays, grammatical error correction (GEC) has a significant role in writing since even native speakers often face challenges with proficient writing. This research is focused on developing a methodology to correct grammatical errors in the Romanian language, a less-resourced language for which there are currently no up-to-date GEC solutions. Our main contributions include an open-source synthetic dataset of 345,403 Romanian sentences, a manually curated dataset of 3054 social media comments, a two-phased GEC approach, and a comparison with several Romanian models, including RoMistral and RoLama3, but also LanguageTool, GPT-4o mini, and GPT-4o. We consider a synthetic dataset to finetune our models, while we rely on two real-life datasets with genuine human mistakes (i.e., CNA and RoComments) to evaluate performance. Building an artificial dataset was necessary because of the scarcity of real-life mistake datasets, whereas introducing RoComments, a new genuine dataset, is argued by the necessity to cover errors amongst native speakers encountered in social media comments. We also introduce a two-phased approach, where we first identify the location of erroneous tokens in the sentence; next, the erroneous tokens are replaced by an encoder–decoder model. Our approach achieved an <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>F</mi><mrow><mn>0.5</mn></mrow></msub></semantics></math></inline-formula> of 0.57 on CNA and 0.64 on RoComments, surpassing by a considerable margin LanguageTool as well as an end-to-end version based on Flan-T5 and mT0 in most setups. While our two-phased method did not outperform GPT-4o, arguably by its smaller size and language exposure, it obtained on-par results with GPT-4o mini and achieved higher performance than all Romanian LLMs.https://www.mdpi.com/2078-2489/16/3/242grammatical error correctiontwo-phased approach with detection and correctionend-to-end correctionRomanian resources |
| spellingShingle | Mihai-Cristian Tudose Stefan Ruseti Mihai Dascalu Show Me All Writing Errors: A Two-Phased Grammatical Error Corrector for Romanian Information grammatical error correction two-phased approach with detection and correction end-to-end correction Romanian resources |
| title | Show Me All Writing Errors: A Two-Phased Grammatical Error Corrector for Romanian |
| title_full | Show Me All Writing Errors: A Two-Phased Grammatical Error Corrector for Romanian |
| title_fullStr | Show Me All Writing Errors: A Two-Phased Grammatical Error Corrector for Romanian |
| title_full_unstemmed | Show Me All Writing Errors: A Two-Phased Grammatical Error Corrector for Romanian |
| title_short | Show Me All Writing Errors: A Two-Phased Grammatical Error Corrector for Romanian |
| title_sort | show me all writing errors a two phased grammatical error corrector for romanian |
| topic | grammatical error correction two-phased approach with detection and correction end-to-end correction Romanian resources |
| url | https://www.mdpi.com/2078-2489/16/3/242 |
| work_keys_str_mv | AT mihaicristiantudose showmeallwritingerrorsatwophasedgrammaticalerrorcorrectorforromanian AT stefanruseti showmeallwritingerrorsatwophasedgrammaticalerrorcorrectorforromanian AT mihaidascalu showmeallwritingerrorsatwophasedgrammaticalerrorcorrectorforromanian |