Automatical sampling with heterogeneous corpora for grammatical error correction

Abstract Thanks to the strong representation capability of the pre-trained language models, supervised grammatical error correction has achieved promising performance. However, traditional model training depends significantly on the large scale of similar distributed samples. The model performance d...

Full description

Saved in:

Bibliographic Details
Main Authors:	Shichang Zhu, Jianjian Liu, Ying Li, Zhengtao Yu
Format:	Article
Language:	English
Published:	Springer 2024-11-01
Series:	Complex & Intelligent Systems
Subjects:	Grammatical error correction Automatical sampling Corpus weighting Heterogeneous model ensemble Pre-trained language models
Online Access:	https://doi.org/10.1007/s40747-024-01653-3
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832571195839479808
author	Shichang Zhu Jianjian Liu Ying Li Zhengtao Yu
author_facet	Shichang Zhu Jianjian Liu Ying Li Zhengtao Yu
author_sort	Shichang Zhu
collection	DOAJ
description	Abstract Thanks to the strong representation capability of the pre-trained language models, supervised grammatical error correction has achieved promising performance. However, traditional model training depends significantly on the large scale of similar distributed samples. The model performance decreases sharply once the distributions of training and testing data are inconsistent. To address this issue, we propose an automatic sampling approach to effectively select high-quality samples from different corpora and filter out irrelevant or harmful ones. Concretely, we first provide a detailed analysis of error type and sentence length distributions on all datasets. Second, our corpus weighting approach is exploited to yield different weights for each sample automatically based on analysis results, thus emphasizing beneficial samples and ignoring the noisy ones. Finally, we enhance typical Seq2Seq and Seq2Edit grammatical error correction models with pre-trained language models and design a model ensemble algorithm for integrating the advantages of heterogeneous models and weighted samples. Experiments on the benchmark datasets demonstrate that the proper utilization of different corpora is extremely helpful in enhancing the accuracy of grammatical error correction. The detailed analysis gains more insights into the effect of different corpus weighting strategies.
format	Article
id	doaj-art-c317d7bcdd2a47529935c35ada7e8d56
institution	Kabale University
issn	2199-4536 2198-6053
language	English
publishDate	2024-11-01
publisher	Springer
record_format	Article
series	Complex & Intelligent Systems
spelling	doaj-art-c317d7bcdd2a47529935c35ada7e8d562025-02-02T12:49:04ZengSpringerComplex & Intelligent Systems2199-45362198-60532024-11-0111111110.1007/s40747-024-01653-3Automatical sampling with heterogeneous corpora for grammatical error correctionShichang Zhu0Jianjian Liu1Ying Li2Zhengtao Yu3Faculty of Information Engineering and Automation, Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and TechnologyFaculty of Information Engineering and Automation, Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and TechnologyFaculty of Information Engineering and Automation, Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and TechnologyFaculty of Information Engineering and Automation, Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and TechnologyAbstract Thanks to the strong representation capability of the pre-trained language models, supervised grammatical error correction has achieved promising performance. However, traditional model training depends significantly on the large scale of similar distributed samples. The model performance decreases sharply once the distributions of training and testing data are inconsistent. To address this issue, we propose an automatic sampling approach to effectively select high-quality samples from different corpora and filter out irrelevant or harmful ones. Concretely, we first provide a detailed analysis of error type and sentence length distributions on all datasets. Second, our corpus weighting approach is exploited to yield different weights for each sample automatically based on analysis results, thus emphasizing beneficial samples and ignoring the noisy ones. Finally, we enhance typical Seq2Seq and Seq2Edit grammatical error correction models with pre-trained language models and design a model ensemble algorithm for integrating the advantages of heterogeneous models and weighted samples. Experiments on the benchmark datasets demonstrate that the proper utilization of different corpora is extremely helpful in enhancing the accuracy of grammatical error correction. The detailed analysis gains more insights into the effect of different corpus weighting strategies.https://doi.org/10.1007/s40747-024-01653-3Grammatical error correctionAutomatical samplingCorpus weightingHeterogeneous model ensemblePre-trained language models
spellingShingle	Shichang Zhu Jianjian Liu Ying Li Zhengtao Yu Automatical sampling with heterogeneous corpora for grammatical error correction Complex & Intelligent Systems Grammatical error correction Automatical sampling Corpus weighting Heterogeneous model ensemble Pre-trained language models
title	Automatical sampling with heterogeneous corpora for grammatical error correction
title_full	Automatical sampling with heterogeneous corpora for grammatical error correction
title_fullStr	Automatical sampling with heterogeneous corpora for grammatical error correction
title_full_unstemmed	Automatical sampling with heterogeneous corpora for grammatical error correction
title_short	Automatical sampling with heterogeneous corpora for grammatical error correction
title_sort	automatical sampling with heterogeneous corpora for grammatical error correction
topic	Grammatical error correction Automatical sampling Corpus weighting Heterogeneous model ensemble Pre-trained language models
url	https://doi.org/10.1007/s40747-024-01653-3
work_keys_str_mv	AT shichangzhu automaticalsamplingwithheterogeneouscorporaforgrammaticalerrorcorrection AT jianjianliu automaticalsamplingwithheterogeneouscorporaforgrammaticalerrorcorrection AT yingli automaticalsamplingwithheterogeneouscorporaforgrammaticalerrorcorrection AT zhengtaoyu automaticalsamplingwithheterogeneouscorporaforgrammaticalerrorcorrection

Automatical sampling with heterogeneous corpora for grammatical error correction

Similar Items