Tokenization efficiency of current foundational large language models for the Ukrainian language

Tokenization efficiency of current foundational large language models for the Ukrainian language

Foundational large language models (LLMs) are deployed in multilingual environments across a range of general and narrow task domains. These models generate text token by token, making them slower and more computationally expensive for low-resource languages that are underrepresented in the tokenize...

Full description

Saved in:

Bibliographic Details
Main Authors:	Daniil Maksymenko, Oleksii Turuta
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2025-08-01
Series:	Frontiers in Artificial Intelligence
Subjects:	tokenization large language model corpus domain low-resource language
Online Access:	https://www.frontiersin.org/articles/10.3389/frai.2025.1538165/full
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

An Analysis of the Training Data Impact for Domain-Adapted Tokenizer Performances—The Case of Serbian Legal Domain Adaptation
by: Miloš Bogdanović, et al.
Published: (2025-07-01)

Mixtec–Spanish Parallel Text Dataset for Language Technology Development
by: Hermilo Santiago-Benito, et al.
Published: (2025-06-01)

Tokenization and deep learning architectures in genomics: A comprehensive review
by: Conrad Testagrose, et al.
Published: (2025-01-01)

EHMQA-GPT: A Knowledge Augmented Large Language Model for Personalized Elderly Health Management
by: Shaofu Lin, et al.
Published: (2025-05-01)

T-LLaMA: a Tibetan large language model based on LLaMA2
by: Hui Lv, et al.
Published: (2024-12-01)

Time-Series Large Language Models: A Systematic Review of State-of-the-Art
by: Shamsu Abdullahi, et al.
Published: (2025-01-01)

Forging Robust Cognition Resilience in Large Language Models: The Self-Correction Reflection Paradigm Against Input Perturbations
by: Hua Wu, et al.
Published: (2025-05-01)

A comprehensive dataset and neural network approach for named entity recognition in the Uzbek languageMendeley Data
by: Davlatyor Mengliev, et al.
Published: (2025-02-01)

Toward Low-Resource Languages Machine Translation: A Language-Specific Fine-Tuning With LoRA for Specialized Large Language Models
by: Xiao Liang, et al.
Published: (2025-01-01)

A Multimodal Large Language Model Framework for Intelligent Perception and Decision-Making in Smart Manufacturing
by: Tianyu Wang, et al.
Published: (2025-05-01)

Harmonizer: A Universal Signal Tokenization Framework for Multimodal Large Language Models
by: Amin Amiri, et al.
Published: (2025-05-01)

Machine learning methods (tokenization) in marketing research
by: E. V. Ganebnykh, et al.
Published: (2024-06-01)

Advancing Sentiment Analysis for Low-Resource Languages Using Fine-Tuned LLMs: A Case Study of Customer Reviews in Turkish Language
by: Rukiye Savran Kiziltepe, et al.
Published: (2025-01-01)

Investigating the Predominance of Large Language Models in Low-Resource Bangla Language over Transformer Models for Hate Speech Detection: A Comparative Analysis
by: Fatema Tuj Johora Faria, et al.
Published: (2024-11-01)

Insights into Low-Resource Language Modelling: Improving Model Performances for South African Languages
by: Ruan Visser, et al.
Published: (2024-12-01)

JiuZhou: open foundation language models and effective pre-training framework for geoscience
by: Zhou Chen, et al.
Published: (2025-08-01)

Improving Token-Based Object Detection With Video
by: Abhineet Singh, et al.
Published: (2025-01-01)

LLMs Have Rhythm: Fingerprinting Large Language Models Using Inter-Token Times and Network Traffic Analysis
by: Saeif Alhazbi, et al.
Published: (2025-01-01)

USAGES AND TRANSLATIONS OF THE ITALIAN TANTO IN SPOKEN LANGUAGE
by: Patrizia Giampieri
Published: (2025-07-01)

USAGES AND TRANSLATIONS OF THE ITALIAN TANTO IN SPOKEN LANGUAGE
by: Patrizia Giampieri
Published: (2025-07-01)

Misinformation Detection: A Review for High and Low-Resource Languages
by: Seani Rananga, et al.
Published: (2024-12-01)

An Adapted Few-Shot Prompting Technique Using ChatGPT to Advance Low-Resource Languages Understanding
by: Saedeh Tahery, et al.
Published: (2025-01-01)

Evaluating Language Comprehension in Alzheimer's disease: the use of the Token Test
by: Jonas Jardim de Paula, et al.
Published: (2012-06-01)

Review of key technologies in knowledge graphs powered by code large language models
by: LI Zixuan, et al.
Published: (2025-03-01)

Review of key technologies in knowledge graphs powered by code large language models
by: LI Zixuan, et al.
Published: (2025-03-01)

A Context-Preserving Tokenization Mismatch Resolution Method for Korean Word Sense Disambiguation Based on the Sejong Corpus and BERT
by: Hanjo Jeong
Published: (2025-03-01)

Localized large language model TCNNet 9B for Taiwanese networking and cybersecurity
by: Jiun-Yi Yang, et al.
Published: (2025-03-01)

Harmony search for hyperparameters optimization of a low resource language transformer model trained with a novel parallel corpus Ocelotl Nahuatl – Spanish
by: Máximo Enrique Pacheco Martínez, et al.
Published: (2024-12-01)

An annotated morphological dataset for Uzbek word forms: Towards rule-based and machine learning approachesMendeley Data
by: Nilufar Abdurakhmonova, et al.
Published: (2025-08-01)

GOVERNANCE TOKENS IN THE CONCEPT OF ELECTRONIC GOVERNMENT
by: Oleksii Dotsenko, et al.
Published: (2023-12-01)

Advancing Large Language Models with Enhanced Retrieval-Augmented Generation: Evidence from Biological UAV Swarm Control
by: Jin-Xing Hao, et al.
Published: (2025-05-01)

From second language acquisition research to foreign language teaching through the prism of corpora
by: Gaëtanelle Gilquin
Published: (2024-12-01)

Cross-Lingual Summarization for Low-Resource Languages Using Multilingual Retrieval-Based In-Context Learning
by: Gyutae Park, et al.
Published: (2025-07-01)

Efficient Chinese-Malay Speech-Text Translation via Layer-Freezing Adaptation of Multimodal Foundation Models
by: Xiao Liang, et al.
Published: (2025-01-01)

Assessing how accurately large language models encode and apply the common European framework of reference for languages
by: Luca Benedetto, et al.
Published: (2025-06-01)

Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation
by: Михаил Тихомиров, et al.
Published: (2024-12-01)

Developing foundations for biomedical knowledgebases from literature using large language models – A systematic assessment
by: Chen Miao, et al.
Published: (2025-01-01)

Benchmarking Open-Source Large Language Models for Sentiment and Emotion Classification in Indonesian Tweets
by: Arbi Haza Nasution, et al.
Published: (2025-01-01)

Construction of Legal Knowledge Graph Based on Knowledge-Enhanced Large Language Models
by: Jun Li, et al.
Published: (2024-10-01)

Large language models for depression recognition in spoken language integrating psychological knowledge
by: Yupei Li, et al.
Published: (2025-08-01)