Tokenization efficiency of current foundational large language models for the Ukrainian language
Foundational large language models (LLMs) are deployed in multilingual environments across a range of general and narrow task domains. These models generate text token by token, making them slower and more computationally expensive for low-resource languages that are underrepresented in the tokenize...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Frontiers Media S.A.
2025-08-01
|
| Series: | Frontiers in Artificial Intelligence |
| Subjects: | |
| Online Access: | https://www.frontiersin.org/articles/10.3389/frai.2025.1538165/full |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849762664471855104 |
|---|---|
| author | Daniil Maksymenko Oleksii Turuta |
| author_facet | Daniil Maksymenko Oleksii Turuta |
| author_sort | Daniil Maksymenko |
| collection | DOAJ |
| description | Foundational large language models (LLMs) are deployed in multilingual environments across a range of general and narrow task domains. These models generate text token by token, making them slower and more computationally expensive for low-resource languages that are underrepresented in the tokenizer vocabulary. It also makes their usage more costly in such cases, as pricing usually depends on the number of input and output tokens. This study compares multiple tokenizers of pretrained LLMs for the Ukrainian language. It also provides tokenization fertility measurements for current state-of-the-art (SOTA) models, both in terms of general-purpose language and specific domains, as well as results of experiments with a transliteration approach to make tokenization more efficient without information loss. The results provide insights into the current models’ disadvantages and possible problems in terms of Ukrainian language modeling. |
| format | Article |
| id | doaj-art-c56feee52a9e4d00b20c0767c7a3d48e |
| institution | DOAJ |
| issn | 2624-8212 |
| language | English |
| publishDate | 2025-08-01 |
| publisher | Frontiers Media S.A. |
| record_format | Article |
| series | Frontiers in Artificial Intelligence |
| spelling | doaj-art-c56feee52a9e4d00b20c0767c7a3d48e2025-08-20T03:05:41ZengFrontiers Media S.A.Frontiers in Artificial Intelligence2624-82122025-08-01810.3389/frai.2025.15381651538165Tokenization efficiency of current foundational large language models for the Ukrainian languageDaniil Maksymenko0Oleksii Turuta1Department of Artificial Intelligence, Kharkiv National University of Radio Electronics, Kharkiv, UkraineComputer Science and Artificial Intelligence Institution, V. N. Karazin Kharkiv National University, Kharkiv, UkraineFoundational large language models (LLMs) are deployed in multilingual environments across a range of general and narrow task domains. These models generate text token by token, making them slower and more computationally expensive for low-resource languages that are underrepresented in the tokenizer vocabulary. It also makes their usage more costly in such cases, as pricing usually depends on the number of input and output tokens. This study compares multiple tokenizers of pretrained LLMs for the Ukrainian language. It also provides tokenization fertility measurements for current state-of-the-art (SOTA) models, both in terms of general-purpose language and specific domains, as well as results of experiments with a transliteration approach to make tokenization more efficient without information loss. The results provide insights into the current models’ disadvantages and possible problems in terms of Ukrainian language modeling.https://www.frontiersin.org/articles/10.3389/frai.2025.1538165/fulltokenizationlarge language modelcorpusdomainlow-resource language |
| spellingShingle | Daniil Maksymenko Oleksii Turuta Tokenization efficiency of current foundational large language models for the Ukrainian language Frontiers in Artificial Intelligence tokenization large language model corpus domain low-resource language |
| title | Tokenization efficiency of current foundational large language models for the Ukrainian language |
| title_full | Tokenization efficiency of current foundational large language models for the Ukrainian language |
| title_fullStr | Tokenization efficiency of current foundational large language models for the Ukrainian language |
| title_full_unstemmed | Tokenization efficiency of current foundational large language models for the Ukrainian language |
| title_short | Tokenization efficiency of current foundational large language models for the Ukrainian language |
| title_sort | tokenization efficiency of current foundational large language models for the ukrainian language |
| topic | tokenization large language model corpus domain low-resource language |
| url | https://www.frontiersin.org/articles/10.3389/frai.2025.1538165/full |
| work_keys_str_mv | AT daniilmaksymenko tokenizationefficiencyofcurrentfoundationallargelanguagemodelsfortheukrainianlanguage AT oleksiituruta tokenizationefficiencyofcurrentfoundationallargelanguagemodelsfortheukrainianlanguage |