Show Me All Writing Errors: A Two-Phased Grammatical Error Corrector for Romanian

Nowadays, grammatical error correction (GEC) has a significant role in writing since even native speakers often face challenges with proficient writing. This research is focused on developing a methodology to correct grammatical errors in the Romanian language, a less-resourced language for which th...

Full description

Saved in:
Bibliographic Details
Main Authors: Mihai-Cristian Tudose, Stefan Ruseti, Mihai Dascalu
Format: Article
Language:English
Published: MDPI AG 2025-03-01
Series:Information
Subjects:
Online Access:https://www.mdpi.com/2078-2489/16/3/242
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849341566795120640
author Mihai-Cristian Tudose
Stefan Ruseti
Mihai Dascalu
author_facet Mihai-Cristian Tudose
Stefan Ruseti
Mihai Dascalu
author_sort Mihai-Cristian Tudose
collection DOAJ
description Nowadays, grammatical error correction (GEC) has a significant role in writing since even native speakers often face challenges with proficient writing. This research is focused on developing a methodology to correct grammatical errors in the Romanian language, a less-resourced language for which there are currently no up-to-date GEC solutions. Our main contributions include an open-source synthetic dataset of 345,403 Romanian sentences, a manually curated dataset of 3054 social media comments, a two-phased GEC approach, and a comparison with several Romanian models, including RoMistral and RoLama3, but also LanguageTool, GPT-4o mini, and GPT-4o. We consider a synthetic dataset to finetune our models, while we rely on two real-life datasets with genuine human mistakes (i.e., CNA and RoComments) to evaluate performance. Building an artificial dataset was necessary because of the scarcity of real-life mistake datasets, whereas introducing RoComments, a new genuine dataset, is argued by the necessity to cover errors amongst native speakers encountered in social media comments. We also introduce a two-phased approach, where we first identify the location of erroneous tokens in the sentence; next, the erroneous tokens are replaced by an encoder–decoder model. Our approach achieved an <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>F</mi><mrow><mn>0.5</mn></mrow></msub></semantics></math></inline-formula> of 0.57 on CNA and 0.64 on RoComments, surpassing by a considerable margin LanguageTool as well as an end-to-end version based on Flan-T5 and mT0 in most setups. While our two-phased method did not outperform GPT-4o, arguably by its smaller size and language exposure, it obtained on-par results with GPT-4o mini and achieved higher performance than all Romanian LLMs.
format Article
id doaj-art-ffd3270b208c4ceb9c899ee549400fa4
institution Kabale University
issn 2078-2489
language English
publishDate 2025-03-01
publisher MDPI AG
record_format Article
series Information
spelling doaj-art-ffd3270b208c4ceb9c899ee549400fa42025-08-20T03:43:36ZengMDPI AGInformation2078-24892025-03-0116324210.3390/info16030242Show Me All Writing Errors: A Two-Phased Grammatical Error Corrector for RomanianMihai-Cristian Tudose0Stefan Ruseti1Mihai Dascalu2Faculty of Automatic Control and Computers, National University of Science and Technology POLITEHNICA Bucharest, 313 Splaiul Independentei, 060042 Bucharest, RomaniaFaculty of Automatic Control and Computers, National University of Science and Technology POLITEHNICA Bucharest, 313 Splaiul Independentei, 060042 Bucharest, RomaniaFaculty of Automatic Control and Computers, National University of Science and Technology POLITEHNICA Bucharest, 313 Splaiul Independentei, 060042 Bucharest, RomaniaNowadays, grammatical error correction (GEC) has a significant role in writing since even native speakers often face challenges with proficient writing. This research is focused on developing a methodology to correct grammatical errors in the Romanian language, a less-resourced language for which there are currently no up-to-date GEC solutions. Our main contributions include an open-source synthetic dataset of 345,403 Romanian sentences, a manually curated dataset of 3054 social media comments, a two-phased GEC approach, and a comparison with several Romanian models, including RoMistral and RoLama3, but also LanguageTool, GPT-4o mini, and GPT-4o. We consider a synthetic dataset to finetune our models, while we rely on two real-life datasets with genuine human mistakes (i.e., CNA and RoComments) to evaluate performance. Building an artificial dataset was necessary because of the scarcity of real-life mistake datasets, whereas introducing RoComments, a new genuine dataset, is argued by the necessity to cover errors amongst native speakers encountered in social media comments. We also introduce a two-phased approach, where we first identify the location of erroneous tokens in the sentence; next, the erroneous tokens are replaced by an encoder–decoder model. Our approach achieved an <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>F</mi><mrow><mn>0.5</mn></mrow></msub></semantics></math></inline-formula> of 0.57 on CNA and 0.64 on RoComments, surpassing by a considerable margin LanguageTool as well as an end-to-end version based on Flan-T5 and mT0 in most setups. While our two-phased method did not outperform GPT-4o, arguably by its smaller size and language exposure, it obtained on-par results with GPT-4o mini and achieved higher performance than all Romanian LLMs.https://www.mdpi.com/2078-2489/16/3/242grammatical error correctiontwo-phased approach with detection and correctionend-to-end correctionRomanian resources
spellingShingle Mihai-Cristian Tudose
Stefan Ruseti
Mihai Dascalu
Show Me All Writing Errors: A Two-Phased Grammatical Error Corrector for Romanian
Information
grammatical error correction
two-phased approach with detection and correction
end-to-end correction
Romanian resources
title Show Me All Writing Errors: A Two-Phased Grammatical Error Corrector for Romanian
title_full Show Me All Writing Errors: A Two-Phased Grammatical Error Corrector for Romanian
title_fullStr Show Me All Writing Errors: A Two-Phased Grammatical Error Corrector for Romanian
title_full_unstemmed Show Me All Writing Errors: A Two-Phased Grammatical Error Corrector for Romanian
title_short Show Me All Writing Errors: A Two-Phased Grammatical Error Corrector for Romanian
title_sort show me all writing errors a two phased grammatical error corrector for romanian
topic grammatical error correction
two-phased approach with detection and correction
end-to-end correction
Romanian resources
url https://www.mdpi.com/2078-2489/16/3/242
work_keys_str_mv AT mihaicristiantudose showmeallwritingerrorsatwophasedgrammaticalerrorcorrectorforromanian
AT stefanruseti showmeallwritingerrorsatwophasedgrammaticalerrorcorrectorforromanian
AT mihaidascalu showmeallwritingerrorsatwophasedgrammaticalerrorcorrectorforromanian