Research on morphological knowledge-guided low-resource agglutinative languages-Chinese translation
Abstract Data sparsity and out-of-vocabulary are the main challenges in low-resource machine translation, and the impact of such problems in translation can be reduced through word segmentation. Word segmentation can be roughly divided into two categories: unsupervised word segmentation and morpholo...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2025-02-01
|
| Series: | Complex & Intelligent Systems |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s40747-025-01780-5 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850252252649881600 |
|---|---|
| author | Gulinigeer Abudouwaili Sirajahmat Ruzmamat Kahaerjiang Abiderexiti Tuergen Yibulayin Nian Yi Aishan Wumaier |
| author_facet | Gulinigeer Abudouwaili Sirajahmat Ruzmamat Kahaerjiang Abiderexiti Tuergen Yibulayin Nian Yi Aishan Wumaier |
| author_sort | Gulinigeer Abudouwaili |
| collection | DOAJ |
| description | Abstract Data sparsity and out-of-vocabulary are the main challenges in low-resource machine translation, and the impact of such problems in translation can be reduced through word segmentation. Word segmentation can be roughly divided into two categories: unsupervised word segmentation and morphological knowledge-based word segmentation. However, the performance of the two methods when applied to downstream tasks varies, and the effectiveness of this method for out-of-vocabulary problem has not yet been verified. Therefore, we first explore the impact of mainstream subword segmentation methods on machine translation and analyze the advantages and disadvantages of the two methods. Secondly, we utilized the advantages of the two methods to improve the existing segmentation methods. Two semi-supervised subword segmentation methods combining morphological knowledge were proposed. The impact of these methods on the distributions of vocabulary and out-of-vocabulary words was analyzed. Once again, a dual encoder was introduced in the encoding to improve the model’s ability to extract information. We improved the existing information fusion methods during feature fusion to avoid information loss. Finally, validation experiments were conducted on low-resource agglutinative languages to Chinese machine translation using Kazakh, Uyghur, and Uzbek as case studies. In this work, we proposed a new subword segmentation method and a novelty machine translation model to improve translation quality to varying degrees, especially on lower data resources. |
| format | Article |
| id | doaj-art-e432ef47760f459e9e150a7545208a46 |
| institution | OA Journals |
| issn | 2199-4536 2198-6053 |
| language | English |
| publishDate | 2025-02-01 |
| publisher | Springer |
| record_format | Article |
| series | Complex & Intelligent Systems |
| spelling | doaj-art-e432ef47760f459e9e150a7545208a462025-08-20T01:57:40ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-02-0111312110.1007/s40747-025-01780-5Research on morphological knowledge-guided low-resource agglutinative languages-Chinese translationGulinigeer Abudouwaili0Sirajahmat Ruzmamat1Kahaerjiang Abiderexiti2Tuergen Yibulayin3Nian Yi4Aishan Wumaier5School of Computer Science and Technology, Xinjiang UniversitySchool of Computer Science and Technology, Xinjiang UniversitySchool of Computer Science and Technology, Xinjiang UniversitySchool of Computer Science and Technology, Xinjiang UniversitySchool of Computer Science and Technology, Xinjiang UniversitySchool of Computer Science and Technology, Xinjiang UniversityAbstract Data sparsity and out-of-vocabulary are the main challenges in low-resource machine translation, and the impact of such problems in translation can be reduced through word segmentation. Word segmentation can be roughly divided into two categories: unsupervised word segmentation and morphological knowledge-based word segmentation. However, the performance of the two methods when applied to downstream tasks varies, and the effectiveness of this method for out-of-vocabulary problem has not yet been verified. Therefore, we first explore the impact of mainstream subword segmentation methods on machine translation and analyze the advantages and disadvantages of the two methods. Secondly, we utilized the advantages of the two methods to improve the existing segmentation methods. Two semi-supervised subword segmentation methods combining morphological knowledge were proposed. The impact of these methods on the distributions of vocabulary and out-of-vocabulary words was analyzed. Once again, a dual encoder was introduced in the encoding to improve the model’s ability to extract information. We improved the existing information fusion methods during feature fusion to avoid information loss. Finally, validation experiments were conducted on low-resource agglutinative languages to Chinese machine translation using Kazakh, Uyghur, and Uzbek as case studies. In this work, we proposed a new subword segmentation method and a novelty machine translation model to improve translation quality to varying degrees, especially on lower data resources.https://doi.org/10.1007/s40747-025-01780-5Subword segmentationMorphological knowledgeMachine translationAgglutinative Languages |
| spellingShingle | Gulinigeer Abudouwaili Sirajahmat Ruzmamat Kahaerjiang Abiderexiti Tuergen Yibulayin Nian Yi Aishan Wumaier Research on morphological knowledge-guided low-resource agglutinative languages-Chinese translation Complex & Intelligent Systems Subword segmentation Morphological knowledge Machine translation Agglutinative Languages |
| title | Research on morphological knowledge-guided low-resource agglutinative languages-Chinese translation |
| title_full | Research on morphological knowledge-guided low-resource agglutinative languages-Chinese translation |
| title_fullStr | Research on morphological knowledge-guided low-resource agglutinative languages-Chinese translation |
| title_full_unstemmed | Research on morphological knowledge-guided low-resource agglutinative languages-Chinese translation |
| title_short | Research on morphological knowledge-guided low-resource agglutinative languages-Chinese translation |
| title_sort | research on morphological knowledge guided low resource agglutinative languages chinese translation |
| topic | Subword segmentation Morphological knowledge Machine translation Agglutinative Languages |
| url | https://doi.org/10.1007/s40747-025-01780-5 |
| work_keys_str_mv | AT gulinigeerabudouwaili researchonmorphologicalknowledgeguidedlowresourceagglutinativelanguageschinesetranslation AT sirajahmatruzmamat researchonmorphologicalknowledgeguidedlowresourceagglutinativelanguageschinesetranslation AT kahaerjiangabiderexiti researchonmorphologicalknowledgeguidedlowresourceagglutinativelanguageschinesetranslation AT tuergenyibulayin researchonmorphologicalknowledgeguidedlowresourceagglutinativelanguageschinesetranslation AT nianyi researchonmorphologicalknowledgeguidedlowresourceagglutinativelanguageschinesetranslation AT aishanwumaier researchonmorphologicalknowledgeguidedlowresourceagglutinativelanguageschinesetranslation |