Research on morphological knowledge-guided low-resource agglutinative languages-Chinese translation
Abstract Data sparsity and out-of-vocabulary are the main challenges in low-resource machine translation, and the impact of such problems in translation can be reduced through word segmentation. Word segmentation can be roughly divided into two categories: unsupervised word segmentation and morpholo...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2025-02-01
|
| Series: | Complex & Intelligent Systems |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s40747-025-01780-5 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract Data sparsity and out-of-vocabulary are the main challenges in low-resource machine translation, and the impact of such problems in translation can be reduced through word segmentation. Word segmentation can be roughly divided into two categories: unsupervised word segmentation and morphological knowledge-based word segmentation. However, the performance of the two methods when applied to downstream tasks varies, and the effectiveness of this method for out-of-vocabulary problem has not yet been verified. Therefore, we first explore the impact of mainstream subword segmentation methods on machine translation and analyze the advantages and disadvantages of the two methods. Secondly, we utilized the advantages of the two methods to improve the existing segmentation methods. Two semi-supervised subword segmentation methods combining morphological knowledge were proposed. The impact of these methods on the distributions of vocabulary and out-of-vocabulary words was analyzed. Once again, a dual encoder was introduced in the encoding to improve the model’s ability to extract information. We improved the existing information fusion methods during feature fusion to avoid information loss. Finally, validation experiments were conducted on low-resource agglutinative languages to Chinese machine translation using Kazakh, Uyghur, and Uzbek as case studies. In this work, we proposed a new subword segmentation method and a novelty machine translation model to improve translation quality to varying degrees, especially on lower data resources. |
|---|---|
| ISSN: | 2199-4536 2198-6053 |