Research on morphological knowledge-guided low-resource agglutinative languages-Chinese translation

Abstract Data sparsity and out-of-vocabulary are the main challenges in low-resource machine translation, and the impact of such problems in translation can be reduced through word segmentation. Word segmentation can be roughly divided into two categories: unsupervised word segmentation and morpholo...

Full description

Saved in:

Bibliographic Details
Main Authors:	Gulinigeer Abudouwaili, Sirajahmat Ruzmamat, Kahaerjiang Abiderexiti, Tuergen Yibulayin, Nian Yi, Aishan Wumaier
Format:	Article
Language:	English
Published:	Springer 2025-02-01
Series:	Complex & Intelligent Systems
Subjects:	Subword segmentation Morphological knowledge Machine translation Agglutinative Languages
Online Access:	https://doi.org/10.1007/s40747-025-01780-5
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Abstract Data sparsity and out-of-vocabulary are the main challenges in low-resource machine translation, and the impact of such problems in translation can be reduced through word segmentation. Word segmentation can be roughly divided into two categories: unsupervised word segmentation and morphological knowledge-based word segmentation. However, the performance of the two methods when applied to downstream tasks varies, and the effectiveness of this method for out-of-vocabulary problem has not yet been verified. Therefore, we first explore the impact of mainstream subword segmentation methods on machine translation and analyze the advantages and disadvantages of the two methods. Secondly, we utilized the advantages of the two methods to improve the existing segmentation methods. Two semi-supervised subword segmentation methods combining morphological knowledge were proposed. The impact of these methods on the distributions of vocabulary and out-of-vocabulary words was analyzed. Once again, a dual encoder was introduced in the encoding to improve the model’s ability to extract information. We improved the existing information fusion methods during feature fusion to avoid information loss. Finally, validation experiments were conducted on low-resource agglutinative languages to Chinese machine translation using Kazakh, Uyghur, and Uzbek as case studies. In this work, we proposed a new subword segmentation method and a novelty machine translation model to improve translation quality to varying degrees, especially on lower data resources.
ISSN:	2199-4536 2198-6053

Research on morphological knowledge-guided low-resource agglutinative languages-Chinese translation

Similar Items