Supplementing CEFR-graded vocabulary lists for language learners by leveraging information on dictionary views, corpus frequency, part-of-speech, and polysemy

Abstract The study explores an approach to supplementing existing CEFR-graded vocabulary lists, which are often incomplete, by imputing CEFR levels for additional vocabulary items. This is achieved by analysing word-level data such as dictionary views, corpus frequency, part-of-speech, and polysemy....

Full description

Saved in:

Bibliographic Details
Main Authors:	Sascha Wolfer, Robert Lew
Format:	Article
Language:	English
Published:	Springer Nature 2025-07-01
Series:	Humanities & Social Sciences Communications
Online Access:	https://doi.org/10.1057/s41599-025-05446-y
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849766581940256768
author	Sascha Wolfer Robert Lew
author_facet	Sascha Wolfer Robert Lew
author_sort	Sascha Wolfer
collection	DOAJ
description	Abstract The study explores an approach to supplementing existing CEFR-graded vocabulary lists, which are often incomplete, by imputing CEFR levels for additional vocabulary items. This is achieved by analysing word-level data such as dictionary views, corpus frequency, part-of-speech, and polysemy. Using English as a test case, the study employs a variety of machine-learning models to predict CEFR levels for words not included in the initial set. The models significantly outperform a random baseline, indicating their effectiveness. The findings suggest that corpus frequency is the most influential predictor, followed by dictionary views and polysemy. The study reveals the potential of this semi-automatic approach to expand CEFR-graded word lists, making them more comprehensive and accessible for language learners. At the same time, human oversight is recommended to ensure the appropriateness of the imputed words for language learners, such as regarding the inclusion of potentially offensive terms. Future research may extend this methodology to other languages, provided that sufficient linguistic data is available.
format	Article
id	doaj-art-66b647245f4e41d4a0c53934bb761cf4
institution	DOAJ
issn	2662-9992
language	English
publishDate	2025-07-01
publisher	Springer Nature
record_format	Article
series	Humanities & Social Sciences Communications
spelling	doaj-art-66b647245f4e41d4a0c53934bb761cf42025-08-20T03:04:31ZengSpringer NatureHumanities & Social Sciences Communications2662-99922025-07-0112111110.1057/s41599-025-05446-ySupplementing CEFR-graded vocabulary lists for language learners by leveraging information on dictionary views, corpus frequency, part-of-speech, and polysemySascha Wolfer0Robert Lew1Leibniz Institute for the German Language (IDS)Adam Mickiewicz UniversityAbstract The study explores an approach to supplementing existing CEFR-graded vocabulary lists, which are often incomplete, by imputing CEFR levels for additional vocabulary items. This is achieved by analysing word-level data such as dictionary views, corpus frequency, part-of-speech, and polysemy. Using English as a test case, the study employs a variety of machine-learning models to predict CEFR levels for words not included in the initial set. The models significantly outperform a random baseline, indicating their effectiveness. The findings suggest that corpus frequency is the most influential predictor, followed by dictionary views and polysemy. The study reveals the potential of this semi-automatic approach to expand CEFR-graded word lists, making them more comprehensive and accessible for language learners. At the same time, human oversight is recommended to ensure the appropriateness of the imputed words for language learners, such as regarding the inclusion of potentially offensive terms. Future research may extend this methodology to other languages, provided that sufficient linguistic data is available.https://doi.org/10.1057/s41599-025-05446-y
spellingShingle	Sascha Wolfer Robert Lew Supplementing CEFR-graded vocabulary lists for language learners by leveraging information on dictionary views, corpus frequency, part-of-speech, and polysemy Humanities & Social Sciences Communications
title	Supplementing CEFR-graded vocabulary lists for language learners by leveraging information on dictionary views, corpus frequency, part-of-speech, and polysemy
title_full	Supplementing CEFR-graded vocabulary lists for language learners by leveraging information on dictionary views, corpus frequency, part-of-speech, and polysemy
title_fullStr	Supplementing CEFR-graded vocabulary lists for language learners by leveraging information on dictionary views, corpus frequency, part-of-speech, and polysemy
title_full_unstemmed	Supplementing CEFR-graded vocabulary lists for language learners by leveraging information on dictionary views, corpus frequency, part-of-speech, and polysemy
title_short	Supplementing CEFR-graded vocabulary lists for language learners by leveraging information on dictionary views, corpus frequency, part-of-speech, and polysemy
title_sort	supplementing cefr graded vocabulary lists for language learners by leveraging information on dictionary views corpus frequency part of speech and polysemy
url	https://doi.org/10.1057/s41599-025-05446-y
work_keys_str_mv	AT saschawolfer supplementingcefrgradedvocabularylistsforlanguagelearnersbyleveraginginformationondictionaryviewscorpusfrequencypartofspeechandpolysemy AT robertlew supplementingcefrgradedvocabularylistsforlanguagelearnersbyleveraginginformationondictionaryviewscorpusfrequencypartofspeechandpolysemy

Supplementing CEFR-graded vocabulary lists for language learners by leveraging information on dictionary views, corpus frequency, part-of-speech, and polysemy

Similar Items