Towards Malay named entity recognition: an open-source dataset and a multi-task framework

Named entity recognition (NER) is a key component of many natural language processing (NLP) applications. The majority of advanced research, however, has not been widely applied to low-resource languages represented by Malay due to the data-hungry problem. In this paper, we present a system for buil...

Full description

Saved in:
Bibliographic Details
Main Authors: Yingwen Fu, Nankai Lin, Zhihe Yang, Shengyi Jiang
Format: Article
Language:English
Published: Taylor & Francis Group 2023-12-01
Series:Connection Science
Subjects:
Online Access:http://dx.doi.org/10.1080/09540091.2022.2159014
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849703490120581120
author Yingwen Fu
Nankai Lin
Zhihe Yang
Shengyi Jiang
author_facet Yingwen Fu
Nankai Lin
Zhihe Yang
Shengyi Jiang
author_sort Yingwen Fu
collection DOAJ
description Named entity recognition (NER) is a key component of many natural language processing (NLP) applications. The majority of advanced research, however, has not been widely applied to low-resource languages represented by Malay due to the data-hungry problem. In this paper, we present a system for building a Malay NER dataset (MS-NER) of 20,146 sentences through labelled datasets of homologous languages and iterative optimisation. Additionally, we propose a Multi-Task framework, namely MTBR, to integrate boundary information more effectively for NER. Specifically, boundary detection is treated as an auxiliary task and an enhanced Bidirectional Revision module with a gated ignoring mechanism is proposed to undertake conditional label transfer. This can reduce error propagation by the auxiliary task. We conduct extensive experiments on Malay, Indonesian, and English. Experimental results show that MTBR could achieve competitive performance and tends to outperform multiple baselines. The constructed dataset and model would be made available to the public as a new, reliable benchmark for Malay NER.
format Article
id doaj-art-0ba6bc85e83d45f29ae6a035d964e73f
institution DOAJ
issn 0954-0091
1360-0494
language English
publishDate 2023-12-01
publisher Taylor & Francis Group
record_format Article
series Connection Science
spelling doaj-art-0ba6bc85e83d45f29ae6a035d964e73f2025-08-20T03:17:14ZengTaylor & Francis GroupConnection Science0954-00911360-04942023-12-0135110.1080/09540091.2022.21590142159014Towards Malay named entity recognition: an open-source dataset and a multi-task frameworkYingwen Fu0Nankai Lin1Zhihe Yang2Shengyi Jiang3Guangdong University of Foreign StudiesGuangdong University of TechnologyGuangdong University of Foreign StudiesGuangdong University of Foreign StudiesNamed entity recognition (NER) is a key component of many natural language processing (NLP) applications. The majority of advanced research, however, has not been widely applied to low-resource languages represented by Malay due to the data-hungry problem. In this paper, we present a system for building a Malay NER dataset (MS-NER) of 20,146 sentences through labelled datasets of homologous languages and iterative optimisation. Additionally, we propose a Multi-Task framework, namely MTBR, to integrate boundary information more effectively for NER. Specifically, boundary detection is treated as an auxiliary task and an enhanced Bidirectional Revision module with a gated ignoring mechanism is proposed to undertake conditional label transfer. This can reduce error propagation by the auxiliary task. We conduct extensive experiments on Malay, Indonesian, and English. Experimental results show that MTBR could achieve competitive performance and tends to outperform multiple baselines. The constructed dataset and model would be made available to the public as a new, reliable benchmark for Malay NER.http://dx.doi.org/10.1080/09540091.2022.2159014malaynamed entity recognitiondatasetmulti-task learningbi-revision
spellingShingle Yingwen Fu
Nankai Lin
Zhihe Yang
Shengyi Jiang
Towards Malay named entity recognition: an open-source dataset and a multi-task framework
Connection Science
malay
named entity recognition
dataset
multi-task learning
bi-revision
title Towards Malay named entity recognition: an open-source dataset and a multi-task framework
title_full Towards Malay named entity recognition: an open-source dataset and a multi-task framework
title_fullStr Towards Malay named entity recognition: an open-source dataset and a multi-task framework
title_full_unstemmed Towards Malay named entity recognition: an open-source dataset and a multi-task framework
title_short Towards Malay named entity recognition: an open-source dataset and a multi-task framework
title_sort towards malay named entity recognition an open source dataset and a multi task framework
topic malay
named entity recognition
dataset
multi-task learning
bi-revision
url http://dx.doi.org/10.1080/09540091.2022.2159014
work_keys_str_mv AT yingwenfu towardsmalaynamedentityrecognitionanopensourcedatasetandamultitaskframework
AT nankailin towardsmalaynamedentityrecognitionanopensourcedatasetandamultitaskframework
AT zhiheyang towardsmalaynamedentityrecognitionanopensourcedatasetandamultitaskframework
AT shengyijiang towardsmalaynamedentityrecognitionanopensourcedatasetandamultitaskframework