Knowledge-based Word Tokenization System for Urdu

Word tokenization, a foundational step in natural language processing (NLP), is critical for tasks like part-of-speech tagging, named entity recognition, and parsing, as well as various independent NLP applications. In our tech-driven era, the exponential growth of textual data on the World Wide Web...

Full description

Saved in:
Bibliographic Details
Main Authors: Asif Khan, Khairullah Khan, Wahab Khan, Sadiq Nawaz Khan, Rafiul Haq
Format: Article
Language:English
Published: MMU Press 2024-06-01
Series:Journal of Informatics and Web Engineering
Subjects:
Online Access:https://journals.mmupress.com/index.php/jiwe/article/view/902
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850136985429082112
author Asif Khan
Khairullah Khan
Wahab Khan
Sadiq Nawaz Khan
Rafiul Haq
author_facet Asif Khan
Khairullah Khan
Wahab Khan
Sadiq Nawaz Khan
Rafiul Haq
author_sort Asif Khan
collection DOAJ
description Word tokenization, a foundational step in natural language processing (NLP), is critical for tasks like part-of-speech tagging, named entity recognition, and parsing, as well as various independent NLP applications. In our tech-driven era, the exponential growth of textual data on the World Wide Web demands sophisticated tools for effective processing. Urdu, spoken widely across the globe, is experiencing a surge in, presents unique challenges due to its distinct writing style, the absence of capitalization features, and the prevalence of compound words. This study introduces a novel knowledge-based word tokenization system tailored for Urdu. Central to this system is a maximum matching model with forward and reverse variants, setting it apart from conventional approaches. The novelty of our system lies in its holistic approach, integrating knowledge-based techniques, dual-variant maximum matching, and heightened adaptability to low-resource language speakers, emphasizing the urgent need for advanced Urdu Language Processing (ULP) systems. However, Urdu, labeled as a low-resource language challenges compared to traditional machine learning (ML) approaches. Significantly, our system eliminates the need for a features file and pre-labelled datasets, streamlining the tokenization process. To evaluate the proposed model's efficacy, a comprehensive analysis was conducted on a dataset comprising 100 sentences with 5,000 Urdu words, yielding an impressive accuracy of 97%. This research makes a substantial contribution to Urdu language processing, providing an innovative solution to the complexities posed by the unique linguistic attributes of Urdu tokenization.
format Article
id doaj-art-58ed2e6b819f418abe54cacf27fc4396
institution OA Journals
issn 2821-370X
language English
publishDate 2024-06-01
publisher MMU Press
record_format Article
series Journal of Informatics and Web Engineering
spelling doaj-art-58ed2e6b819f418abe54cacf27fc43962025-08-20T02:30:59ZengMMU PressJournal of Informatics and Web Engineering2821-370X2024-06-0132869710.33093/jiwe.2024.3.2.6901Knowledge-based Word Tokenization System for UrduAsif Khan0Khairullah Khan1Wahab Khan2https://orcid.org/0000-0002-5694-0419Sadiq Nawaz Khan3Rafiul Haq4University of Science & Technology Bannu, PakistanUniversity of Science & Technology Bannu, PakistanInternational Islamic University Islamabad, PakistanUniversity of Science & Technology Bannu, PakistanTianjin University, ChinaWord tokenization, a foundational step in natural language processing (NLP), is critical for tasks like part-of-speech tagging, named entity recognition, and parsing, as well as various independent NLP applications. In our tech-driven era, the exponential growth of textual data on the World Wide Web demands sophisticated tools for effective processing. Urdu, spoken widely across the globe, is experiencing a surge in, presents unique challenges due to its distinct writing style, the absence of capitalization features, and the prevalence of compound words. This study introduces a novel knowledge-based word tokenization system tailored for Urdu. Central to this system is a maximum matching model with forward and reverse variants, setting it apart from conventional approaches. The novelty of our system lies in its holistic approach, integrating knowledge-based techniques, dual-variant maximum matching, and heightened adaptability to low-resource language speakers, emphasizing the urgent need for advanced Urdu Language Processing (ULP) systems. However, Urdu, labeled as a low-resource language challenges compared to traditional machine learning (ML) approaches. Significantly, our system eliminates the need for a features file and pre-labelled datasets, streamlining the tokenization process. To evaluate the proposed model's efficacy, a comprehensive analysis was conducted on a dataset comprising 100 sentences with 5,000 Urdu words, yielding an impressive accuracy of 97%. This research makes a substantial contribution to Urdu language processing, providing an innovative solution to the complexities posed by the unique linguistic attributes of Urdu tokenization.https://journals.mmupress.com/index.php/jiwe/article/view/902natural language processing (nlp)urdu language processing (ulp))forward maximum matching (fmm)reverse maximum matching (rmm)part-of-speech tagging (pos)
spellingShingle Asif Khan
Khairullah Khan
Wahab Khan
Sadiq Nawaz Khan
Rafiul Haq
Knowledge-based Word Tokenization System for Urdu
Journal of Informatics and Web Engineering
natural language processing (nlp)
urdu language processing (ulp))
forward maximum matching (fmm)
reverse maximum matching (rmm)
part-of-speech tagging (pos)
title Knowledge-based Word Tokenization System for Urdu
title_full Knowledge-based Word Tokenization System for Urdu
title_fullStr Knowledge-based Word Tokenization System for Urdu
title_full_unstemmed Knowledge-based Word Tokenization System for Urdu
title_short Knowledge-based Word Tokenization System for Urdu
title_sort knowledge based word tokenization system for urdu
topic natural language processing (nlp)
urdu language processing (ulp))
forward maximum matching (fmm)
reverse maximum matching (rmm)
part-of-speech tagging (pos)
url https://journals.mmupress.com/index.php/jiwe/article/view/902
work_keys_str_mv AT asifkhan knowledgebasedwordtokenizationsystemforurdu
AT khairullahkhan knowledgebasedwordtokenizationsystemforurdu
AT wahabkhan knowledgebasedwordtokenizationsystemforurdu
AT sadiqnawazkhan knowledgebasedwordtokenizationsystemforurdu
AT rafiulhaq knowledgebasedwordtokenizationsystemforurdu