Tokenization efficiency of current foundational large language models for the Ukrainian language

Foundational large language models (LLMs) are deployed in multilingual environments across a range of general and narrow task domains. These models generate text token by token, making them slower and more computationally expensive for low-resource languages that are underrepresented in the tokenize...

Full description

Saved in:
Bibliographic Details
Main Authors: Daniil Maksymenko, Oleksii Turuta
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-08-01
Series:Frontiers in Artificial Intelligence
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/frai.2025.1538165/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849762664471855104
author Daniil Maksymenko
Oleksii Turuta
author_facet Daniil Maksymenko
Oleksii Turuta
author_sort Daniil Maksymenko
collection DOAJ
description Foundational large language models (LLMs) are deployed in multilingual environments across a range of general and narrow task domains. These models generate text token by token, making them slower and more computationally expensive for low-resource languages that are underrepresented in the tokenizer vocabulary. It also makes their usage more costly in such cases, as pricing usually depends on the number of input and output tokens. This study compares multiple tokenizers of pretrained LLMs for the Ukrainian language. It also provides tokenization fertility measurements for current state-of-the-art (SOTA) models, both in terms of general-purpose language and specific domains, as well as results of experiments with a transliteration approach to make tokenization more efficient without information loss. The results provide insights into the current models’ disadvantages and possible problems in terms of Ukrainian language modeling.
format Article
id doaj-art-c56feee52a9e4d00b20c0767c7a3d48e
institution DOAJ
issn 2624-8212
language English
publishDate 2025-08-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Artificial Intelligence
spelling doaj-art-c56feee52a9e4d00b20c0767c7a3d48e2025-08-20T03:05:41ZengFrontiers Media S.A.Frontiers in Artificial Intelligence2624-82122025-08-01810.3389/frai.2025.15381651538165Tokenization efficiency of current foundational large language models for the Ukrainian languageDaniil Maksymenko0Oleksii Turuta1Department of Artificial Intelligence, Kharkiv National University of Radio Electronics, Kharkiv, UkraineComputer Science and Artificial Intelligence Institution, V. N. Karazin Kharkiv National University, Kharkiv, UkraineFoundational large language models (LLMs) are deployed in multilingual environments across a range of general and narrow task domains. These models generate text token by token, making them slower and more computationally expensive for low-resource languages that are underrepresented in the tokenizer vocabulary. It also makes their usage more costly in such cases, as pricing usually depends on the number of input and output tokens. This study compares multiple tokenizers of pretrained LLMs for the Ukrainian language. It also provides tokenization fertility measurements for current state-of-the-art (SOTA) models, both in terms of general-purpose language and specific domains, as well as results of experiments with a transliteration approach to make tokenization more efficient without information loss. The results provide insights into the current models’ disadvantages and possible problems in terms of Ukrainian language modeling.https://www.frontiersin.org/articles/10.3389/frai.2025.1538165/fulltokenizationlarge language modelcorpusdomainlow-resource language
spellingShingle Daniil Maksymenko
Oleksii Turuta
Tokenization efficiency of current foundational large language models for the Ukrainian language
Frontiers in Artificial Intelligence
tokenization
large language model
corpus
domain
low-resource language
title Tokenization efficiency of current foundational large language models for the Ukrainian language
title_full Tokenization efficiency of current foundational large language models for the Ukrainian language
title_fullStr Tokenization efficiency of current foundational large language models for the Ukrainian language
title_full_unstemmed Tokenization efficiency of current foundational large language models for the Ukrainian language
title_short Tokenization efficiency of current foundational large language models for the Ukrainian language
title_sort tokenization efficiency of current foundational large language models for the ukrainian language
topic tokenization
large language model
corpus
domain
low-resource language
url https://www.frontiersin.org/articles/10.3389/frai.2025.1538165/full
work_keys_str_mv AT daniilmaksymenko tokenizationefficiencyofcurrentfoundationallargelanguagemodelsfortheukrainianlanguage
AT oleksiituruta tokenizationefficiencyofcurrentfoundationallargelanguagemodelsfortheukrainianlanguage