OdNER: NER resource creation and system development for low-resource Odia language

This work aims to enhance the usability of natural language processing (NLP) based systems for the low-resource Odia language by focusing on the development of effective named entity recognition (NER) system. NLP applications rely heavily on NER to extract relevant information from massive amounts o...

Full description

Saved in:

Bibliographic Details
Main Authors:	Tusarkanta Dalai, Anupam Das, Tapas Kumar Mishra, Pankaj Kumar Sa
Format:	Article
Language:	English
Published:	Elsevier 2025-06-01
Series:	Natural Language Processing Journal
Subjects:	Named entity recognition (NER) Conditional random field (CRF) Deep learning Transformer Low-resource language
Online Access:	http://www.sciencedirect.com/science/article/pii/S2949719125000159
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850114232360632320
author	Tusarkanta Dalai Anupam Das Tapas Kumar Mishra Pankaj Kumar Sa
author_facet	Tusarkanta Dalai Anupam Das Tapas Kumar Mishra Pankaj Kumar Sa
author_sort	Tusarkanta Dalai
collection	DOAJ
description	This work aims to enhance the usability of natural language processing (NLP) based systems for the low-resource Odia language by focusing on the development of effective named entity recognition (NER) system. NLP applications rely heavily on NER to extract relevant information from massive amounts of unstructured text. The task of identifying and classifying the named entities included in a given text into a set of predetermined categories is referred to as NER. Already, the NER task has accomplished productive results in English as well as in a number of other European languages. On the other hand, because of a lack of supporting tools and resources, it has not yet been thoroughly investigated in Indian languages, particularly the Odia language. Recently, approaches based on machine learning (ML) and deep learning (DL) have demonstrated exceptional performance when it comes to constructing NLP tasks. Moreover, transformer models, particularly masked-language models (MLM), have demonstrated remarkable efficacy in the NER task; nevertheless, these methods generally call for massive volumes of annotated corpus. Unfortunately, we could not find any open-source NER corpus for the Odia language. The purpose of this research is to compile OdNER, a NER dataset with quality baselines for the low-resource Odia language. The Odia NER corpus OdNER contains 48,000 sentences having 6,71,354 tokens and 98,116 name entities annotated with 12 tags. To establish the quality of our corpus, we use conditional random field (CRF) and BiLSTM model as our baseline models. To demonstrate the efficacy of our dataset, we conduct a comparative evaluation of various transformer-based multilingual language models (IndicBERT, MuRIL, XLM-R) and utilize them to carry out the sequence labeling task for NER. With the pre-trained XLM-R multilingual model, our dataset achieves a maximum F1 score of 90.48%. When it comes to Odia NER, no other work comes close to matching the quality and quantity of ours. We anticipate that, this work will have made substantial progress toward the development of NLP tasks for the Odia language.
format	Article
id	doaj-art-7b07fd4dbc0644899d02a0fd2a52bd55
institution	OA Journals
issn	2949-7191
language	English
publishDate	2025-06-01
publisher	Elsevier
record_format	Article
series	Natural Language Processing Journal
spelling	doaj-art-7b07fd4dbc0644899d02a0fd2a52bd552025-08-20T02:36:58ZengElsevierNatural Language Processing Journal2949-71912025-06-011110013910.1016/j.nlp.2025.100139OdNER: NER resource creation and system development for low-resource Odia languageTusarkanta Dalai0Anupam Das1Tapas Kumar Mishra2Pankaj Kumar Sa3Department of Computer Science and Engineering, NIT Rourkela, Rourkela, Odisha, India; Department of Computer Science and Engineering, Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar, Odisha, India; Corresponding author at: Department of Computer Science and Engineering, NIT Rourkela, Rourkela, Odisha, India.Department of Computer Science and Engineering, NIT Rourkela, Rourkela, Odisha, IndiaDepartment of Computer Science and Engineering, NIT Rourkela, Rourkela, Odisha, IndiaDepartment of Computer Science and Engineering, NIT Rourkela, Rourkela, Odisha, IndiaThis work aims to enhance the usability of natural language processing (NLP) based systems for the low-resource Odia language by focusing on the development of effective named entity recognition (NER) system. NLP applications rely heavily on NER to extract relevant information from massive amounts of unstructured text. The task of identifying and classifying the named entities included in a given text into a set of predetermined categories is referred to as NER. Already, the NER task has accomplished productive results in English as well as in a number of other European languages. On the other hand, because of a lack of supporting tools and resources, it has not yet been thoroughly investigated in Indian languages, particularly the Odia language. Recently, approaches based on machine learning (ML) and deep learning (DL) have demonstrated exceptional performance when it comes to constructing NLP tasks. Moreover, transformer models, particularly masked-language models (MLM), have demonstrated remarkable efficacy in the NER task; nevertheless, these methods generally call for massive volumes of annotated corpus. Unfortunately, we could not find any open-source NER corpus for the Odia language. The purpose of this research is to compile OdNER, a NER dataset with quality baselines for the low-resource Odia language. The Odia NER corpus OdNER contains 48,000 sentences having 6,71,354 tokens and 98,116 name entities annotated with 12 tags. To establish the quality of our corpus, we use conditional random field (CRF) and BiLSTM model as our baseline models. To demonstrate the efficacy of our dataset, we conduct a comparative evaluation of various transformer-based multilingual language models (IndicBERT, MuRIL, XLM-R) and utilize them to carry out the sequence labeling task for NER. With the pre-trained XLM-R multilingual model, our dataset achieves a maximum F1 score of 90.48%. When it comes to Odia NER, no other work comes close to matching the quality and quantity of ours. We anticipate that, this work will have made substantial progress toward the development of NLP tasks for the Odia language.http://www.sciencedirect.com/science/article/pii/S2949719125000159Named entity recognition (NER)Conditional random field (CRF)Deep learningTransformerLow-resource language
spellingShingle	Tusarkanta Dalai Anupam Das Tapas Kumar Mishra Pankaj Kumar Sa OdNER: NER resource creation and system development for low-resource Odia language Natural Language Processing Journal Named entity recognition (NER) Conditional random field (CRF) Deep learning Transformer Low-resource language
title	OdNER: NER resource creation and system development for low-resource Odia language
title_full	OdNER: NER resource creation and system development for low-resource Odia language
title_fullStr	OdNER: NER resource creation and system development for low-resource Odia language
title_full_unstemmed	OdNER: NER resource creation and system development for low-resource Odia language
title_short	OdNER: NER resource creation and system development for low-resource Odia language
title_sort	odner ner resource creation and system development for low resource odia language
topic	Named entity recognition (NER) Conditional random field (CRF) Deep learning Transformer Low-resource language
url	http://www.sciencedirect.com/science/article/pii/S2949719125000159
work_keys_str_mv	AT tusarkantadalai odnernerresourcecreationandsystemdevelopmentforlowresourceodialanguage AT anupamdas odnernerresourcecreationandsystemdevelopmentforlowresourceodialanguage AT tapaskumarmishra odnernerresourcecreationandsystemdevelopmentforlowresourceodialanguage AT pankajkumarsa odnernerresourcecreationandsystemdevelopmentforlowresourceodialanguage

OdNER: NER resource creation and system development for low-resource Odia language

Similar Items