Research on the Uyghur morphological segmentation model with an attention mechanism

Morphological segmentation is a basic task in agglutinative language information processing, dividing words into the smallest semantic unit morphemes. There are two types of morphological segmentation: canonical segmentation and surface segmentation. As a typical agglutinative language, Uyghur usual...

Full description

Saved in:
Bibliographic Details
Main Authors: Gulinigeer Abudouwaili, Kahaerjing Abiderexiti, Yunfei Shen, Aishan Wumaier
Format: Article
Language:English
Published: Taylor & Francis Group 2022-12-01
Series:Connection Science
Subjects:
Online Access:http://dx.doi.org/10.1080/09540091.2022.2134843
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849395460442161152
author Gulinigeer Abudouwaili
Kahaerjing Abiderexiti
Yunfei Shen
Aishan Wumaier
author_facet Gulinigeer Abudouwaili
Kahaerjing Abiderexiti
Yunfei Shen
Aishan Wumaier
author_sort Gulinigeer Abudouwaili
collection DOAJ
description Morphological segmentation is a basic task in agglutinative language information processing, dividing words into the smallest semantic unit morphemes. There are two types of morphological segmentation: canonical segmentation and surface segmentation. As a typical agglutinative language, Uyghur usually uses statistical-based methods in canonical segmentation, which relies on the artificial extraction of features. In surface segmentation, the artificial feature extraction process is avoided by using the neural network. However, to date, no model can provide both segmentation results in Uyghur without adding features. In addition, morphological segmentation is usually regarded as a sequence annotation task, so label imbalance easily occurs in datasets. Given the above situation, this paper proposes an improved labelling scheme that joins morphological boundary labels and voice harmony labels for the two kinds of segmentation simultaneously. Then, a convolution network and attention mechanism are added to capture local and global features, respectively. Finally, morphological segmentation is regarded as a sequence labeling task of character sequences. Due to the problem of label proportion imbalance and noise in the dataset, a focal loss function with label smoothing is used. The experimental results show that the F1 values of canonical segmentation and surface segmentation achieve the best results.
format Article
id doaj-art-a78ddf8f05a44086b4b0f4110e976bcb
institution Kabale University
issn 0954-0091
1360-0494
language English
publishDate 2022-12-01
publisher Taylor & Francis Group
record_format Article
series Connection Science
spelling doaj-art-a78ddf8f05a44086b4b0f4110e976bcb2025-08-20T03:39:36ZengTaylor & Francis GroupConnection Science0954-00911360-04942022-12-013412577259610.1080/09540091.2022.21348432134843Research on the Uyghur morphological segmentation model with an attention mechanismGulinigeer Abudouwaili0Kahaerjing Abiderexiti1Yunfei Shen2Aishan Wumaier3Xinjiang UniversityXinjiang UniversityXinjiang UniversityXinjiang UniversityMorphological segmentation is a basic task in agglutinative language information processing, dividing words into the smallest semantic unit morphemes. There are two types of morphological segmentation: canonical segmentation and surface segmentation. As a typical agglutinative language, Uyghur usually uses statistical-based methods in canonical segmentation, which relies on the artificial extraction of features. In surface segmentation, the artificial feature extraction process is avoided by using the neural network. However, to date, no model can provide both segmentation results in Uyghur without adding features. In addition, morphological segmentation is usually regarded as a sequence annotation task, so label imbalance easily occurs in datasets. Given the above situation, this paper proposes an improved labelling scheme that joins morphological boundary labels and voice harmony labels for the two kinds of segmentation simultaneously. Then, a convolution network and attention mechanism are added to capture local and global features, respectively. Finally, morphological segmentation is regarded as a sequence labeling task of character sequences. Due to the problem of label proportion imbalance and noise in the dataset, a focal loss function with label smoothing is used. The experimental results show that the F1 values of canonical segmentation and surface segmentation achieve the best results.http://dx.doi.org/10.1080/09540091.2022.2134843attention mechanismlabel imbalancemorphological segmentationtagging scheme
spellingShingle Gulinigeer Abudouwaili
Kahaerjing Abiderexiti
Yunfei Shen
Aishan Wumaier
Research on the Uyghur morphological segmentation model with an attention mechanism
Connection Science
attention mechanism
label imbalance
morphological segmentation
tagging scheme
title Research on the Uyghur morphological segmentation model with an attention mechanism
title_full Research on the Uyghur morphological segmentation model with an attention mechanism
title_fullStr Research on the Uyghur morphological segmentation model with an attention mechanism
title_full_unstemmed Research on the Uyghur morphological segmentation model with an attention mechanism
title_short Research on the Uyghur morphological segmentation model with an attention mechanism
title_sort research on the uyghur morphological segmentation model with an attention mechanism
topic attention mechanism
label imbalance
morphological segmentation
tagging scheme
url http://dx.doi.org/10.1080/09540091.2022.2134843
work_keys_str_mv AT gulinigeerabudouwaili researchontheuyghurmorphologicalsegmentationmodelwithanattentionmechanism
AT kahaerjingabiderexiti researchontheuyghurmorphologicalsegmentationmodelwithanattentionmechanism
AT yunfeishen researchontheuyghurmorphologicalsegmentationmodelwithanattentionmechanism
AT aishanwumaier researchontheuyghurmorphologicalsegmentationmodelwithanattentionmechanism