Research on morphological knowledge-guided low-resource agglutinative languages-Chinese translation

Abstract Data sparsity and out-of-vocabulary are the main challenges in low-resource machine translation, and the impact of such problems in translation can be reduced through word segmentation. Word segmentation can be roughly divided into two categories: unsupervised word segmentation and morpholo...

Full description

Saved in:
Bibliographic Details
Main Authors: Gulinigeer Abudouwaili, Sirajahmat Ruzmamat, Kahaerjiang Abiderexiti, Tuergen Yibulayin, Nian Yi, Aishan Wumaier
Format: Article
Language:English
Published: Springer 2025-02-01
Series:Complex & Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1007/s40747-025-01780-5
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850252252649881600
author Gulinigeer Abudouwaili
Sirajahmat Ruzmamat
Kahaerjiang Abiderexiti
Tuergen Yibulayin
Nian Yi
Aishan Wumaier
author_facet Gulinigeer Abudouwaili
Sirajahmat Ruzmamat
Kahaerjiang Abiderexiti
Tuergen Yibulayin
Nian Yi
Aishan Wumaier
author_sort Gulinigeer Abudouwaili
collection DOAJ
description Abstract Data sparsity and out-of-vocabulary are the main challenges in low-resource machine translation, and the impact of such problems in translation can be reduced through word segmentation. Word segmentation can be roughly divided into two categories: unsupervised word segmentation and morphological knowledge-based word segmentation. However, the performance of the two methods when applied to downstream tasks varies, and the effectiveness of this method for out-of-vocabulary problem has not yet been verified. Therefore, we first explore the impact of mainstream subword segmentation methods on machine translation and analyze the advantages and disadvantages of the two methods. Secondly, we utilized the advantages of the two methods to improve the existing segmentation methods. Two semi-supervised subword segmentation methods combining morphological knowledge were proposed. The impact of these methods on the distributions of vocabulary and out-of-vocabulary words was analyzed. Once again, a dual encoder was introduced in the encoding to improve the model’s ability to extract information. We improved the existing information fusion methods during feature fusion to avoid information loss. Finally, validation experiments were conducted on low-resource agglutinative languages to Chinese machine translation using Kazakh, Uyghur, and Uzbek as case studies. In this work, we proposed a new subword segmentation method and a novelty machine translation model to improve translation quality to varying degrees, especially on lower data resources.
format Article
id doaj-art-e432ef47760f459e9e150a7545208a46
institution OA Journals
issn 2199-4536
2198-6053
language English
publishDate 2025-02-01
publisher Springer
record_format Article
series Complex & Intelligent Systems
spelling doaj-art-e432ef47760f459e9e150a7545208a462025-08-20T01:57:40ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-02-0111312110.1007/s40747-025-01780-5Research on morphological knowledge-guided low-resource agglutinative languages-Chinese translationGulinigeer Abudouwaili0Sirajahmat Ruzmamat1Kahaerjiang Abiderexiti2Tuergen Yibulayin3Nian Yi4Aishan Wumaier5School of Computer Science and Technology, Xinjiang UniversitySchool of Computer Science and Technology, Xinjiang UniversitySchool of Computer Science and Technology, Xinjiang UniversitySchool of Computer Science and Technology, Xinjiang UniversitySchool of Computer Science and Technology, Xinjiang UniversitySchool of Computer Science and Technology, Xinjiang UniversityAbstract Data sparsity and out-of-vocabulary are the main challenges in low-resource machine translation, and the impact of such problems in translation can be reduced through word segmentation. Word segmentation can be roughly divided into two categories: unsupervised word segmentation and morphological knowledge-based word segmentation. However, the performance of the two methods when applied to downstream tasks varies, and the effectiveness of this method for out-of-vocabulary problem has not yet been verified. Therefore, we first explore the impact of mainstream subword segmentation methods on machine translation and analyze the advantages and disadvantages of the two methods. Secondly, we utilized the advantages of the two methods to improve the existing segmentation methods. Two semi-supervised subword segmentation methods combining morphological knowledge were proposed. The impact of these methods on the distributions of vocabulary and out-of-vocabulary words was analyzed. Once again, a dual encoder was introduced in the encoding to improve the model’s ability to extract information. We improved the existing information fusion methods during feature fusion to avoid information loss. Finally, validation experiments were conducted on low-resource agglutinative languages to Chinese machine translation using Kazakh, Uyghur, and Uzbek as case studies. In this work, we proposed a new subword segmentation method and a novelty machine translation model to improve translation quality to varying degrees, especially on lower data resources.https://doi.org/10.1007/s40747-025-01780-5Subword segmentationMorphological knowledgeMachine translationAgglutinative Languages
spellingShingle Gulinigeer Abudouwaili
Sirajahmat Ruzmamat
Kahaerjiang Abiderexiti
Tuergen Yibulayin
Nian Yi
Aishan Wumaier
Research on morphological knowledge-guided low-resource agglutinative languages-Chinese translation
Complex & Intelligent Systems
Subword segmentation
Morphological knowledge
Machine translation
Agglutinative Languages
title Research on morphological knowledge-guided low-resource agglutinative languages-Chinese translation
title_full Research on morphological knowledge-guided low-resource agglutinative languages-Chinese translation
title_fullStr Research on morphological knowledge-guided low-resource agglutinative languages-Chinese translation
title_full_unstemmed Research on morphological knowledge-guided low-resource agglutinative languages-Chinese translation
title_short Research on morphological knowledge-guided low-resource agglutinative languages-Chinese translation
title_sort research on morphological knowledge guided low resource agglutinative languages chinese translation
topic Subword segmentation
Morphological knowledge
Machine translation
Agglutinative Languages
url https://doi.org/10.1007/s40747-025-01780-5
work_keys_str_mv AT gulinigeerabudouwaili researchonmorphologicalknowledgeguidedlowresourceagglutinativelanguageschinesetranslation
AT sirajahmatruzmamat researchonmorphologicalknowledgeguidedlowresourceagglutinativelanguageschinesetranslation
AT kahaerjiangabiderexiti researchonmorphologicalknowledgeguidedlowresourceagglutinativelanguageschinesetranslation
AT tuergenyibulayin researchonmorphologicalknowledgeguidedlowresourceagglutinativelanguageschinesetranslation
AT nianyi researchonmorphologicalknowledgeguidedlowresourceagglutinativelanguageschinesetranslation
AT aishanwumaier researchonmorphologicalknowledgeguidedlowresourceagglutinativelanguageschinesetranslation