Knowledge-based Word Tokenization System for Urdu
Word tokenization, a foundational step in natural language processing (NLP), is critical for tasks like part-of-speech tagging, named entity recognition, and parsing, as well as various independent NLP applications. In our tech-driven era, the exponential growth of textual data on the World Wide Web...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MMU Press
2024-06-01
|
| Series: | Journal of Informatics and Web Engineering |
| Subjects: | |
| Online Access: | https://journals.mmupress.com/index.php/jiwe/article/view/902 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850136985429082112 |
|---|---|
| author | Asif Khan Khairullah Khan Wahab Khan Sadiq Nawaz Khan Rafiul Haq |
| author_facet | Asif Khan Khairullah Khan Wahab Khan Sadiq Nawaz Khan Rafiul Haq |
| author_sort | Asif Khan |
| collection | DOAJ |
| description | Word tokenization, a foundational step in natural language processing (NLP), is critical for tasks like part-of-speech tagging, named entity recognition, and parsing, as well as various independent NLP applications. In our tech-driven era, the exponential growth of textual data on the World Wide Web demands sophisticated tools for effective processing. Urdu, spoken widely across the globe, is experiencing a surge in, presents unique challenges due to its distinct writing style, the absence of capitalization features, and the prevalence of compound words. This study introduces a novel knowledge-based word tokenization system tailored for Urdu. Central to this system is a maximum matching model with forward and reverse variants, setting it apart from conventional approaches. The novelty of our system lies in its holistic approach, integrating knowledge-based techniques, dual-variant maximum matching, and heightened adaptability to low-resource language speakers, emphasizing the urgent need for advanced Urdu Language Processing (ULP) systems. However, Urdu, labeled as a low-resource language challenges compared to traditional machine learning (ML) approaches. Significantly, our system eliminates the need for a features file and pre-labelled datasets, streamlining the tokenization process. To evaluate the proposed model's efficacy, a comprehensive analysis was conducted on a dataset comprising 100 sentences with 5,000 Urdu words, yielding an impressive accuracy of 97%. This research makes a substantial contribution to Urdu language processing, providing an innovative solution to the complexities posed by the unique linguistic attributes of Urdu tokenization. |
| format | Article |
| id | doaj-art-58ed2e6b819f418abe54cacf27fc4396 |
| institution | OA Journals |
| issn | 2821-370X |
| language | English |
| publishDate | 2024-06-01 |
| publisher | MMU Press |
| record_format | Article |
| series | Journal of Informatics and Web Engineering |
| spelling | doaj-art-58ed2e6b819f418abe54cacf27fc43962025-08-20T02:30:59ZengMMU PressJournal of Informatics and Web Engineering2821-370X2024-06-0132869710.33093/jiwe.2024.3.2.6901Knowledge-based Word Tokenization System for UrduAsif Khan0Khairullah Khan1Wahab Khan2https://orcid.org/0000-0002-5694-0419Sadiq Nawaz Khan3Rafiul Haq4University of Science & Technology Bannu, PakistanUniversity of Science & Technology Bannu, PakistanInternational Islamic University Islamabad, PakistanUniversity of Science & Technology Bannu, PakistanTianjin University, ChinaWord tokenization, a foundational step in natural language processing (NLP), is critical for tasks like part-of-speech tagging, named entity recognition, and parsing, as well as various independent NLP applications. In our tech-driven era, the exponential growth of textual data on the World Wide Web demands sophisticated tools for effective processing. Urdu, spoken widely across the globe, is experiencing a surge in, presents unique challenges due to its distinct writing style, the absence of capitalization features, and the prevalence of compound words. This study introduces a novel knowledge-based word tokenization system tailored for Urdu. Central to this system is a maximum matching model with forward and reverse variants, setting it apart from conventional approaches. The novelty of our system lies in its holistic approach, integrating knowledge-based techniques, dual-variant maximum matching, and heightened adaptability to low-resource language speakers, emphasizing the urgent need for advanced Urdu Language Processing (ULP) systems. However, Urdu, labeled as a low-resource language challenges compared to traditional machine learning (ML) approaches. Significantly, our system eliminates the need for a features file and pre-labelled datasets, streamlining the tokenization process. To evaluate the proposed model's efficacy, a comprehensive analysis was conducted on a dataset comprising 100 sentences with 5,000 Urdu words, yielding an impressive accuracy of 97%. This research makes a substantial contribution to Urdu language processing, providing an innovative solution to the complexities posed by the unique linguistic attributes of Urdu tokenization.https://journals.mmupress.com/index.php/jiwe/article/view/902natural language processing (nlp)urdu language processing (ulp))forward maximum matching (fmm)reverse maximum matching (rmm)part-of-speech tagging (pos) |
| spellingShingle | Asif Khan Khairullah Khan Wahab Khan Sadiq Nawaz Khan Rafiul Haq Knowledge-based Word Tokenization System for Urdu Journal of Informatics and Web Engineering natural language processing (nlp) urdu language processing (ulp)) forward maximum matching (fmm) reverse maximum matching (rmm) part-of-speech tagging (pos) |
| title | Knowledge-based Word Tokenization System for Urdu |
| title_full | Knowledge-based Word Tokenization System for Urdu |
| title_fullStr | Knowledge-based Word Tokenization System for Urdu |
| title_full_unstemmed | Knowledge-based Word Tokenization System for Urdu |
| title_short | Knowledge-based Word Tokenization System for Urdu |
| title_sort | knowledge based word tokenization system for urdu |
| topic | natural language processing (nlp) urdu language processing (ulp)) forward maximum matching (fmm) reverse maximum matching (rmm) part-of-speech tagging (pos) |
| url | https://journals.mmupress.com/index.php/jiwe/article/view/902 |
| work_keys_str_mv | AT asifkhan knowledgebasedwordtokenizationsystemforurdu AT khairullahkhan knowledgebasedwordtokenizationsystemforurdu AT wahabkhan knowledgebasedwordtokenizationsystemforurdu AT sadiqnawazkhan knowledgebasedwordtokenizationsystemforurdu AT rafiulhaq knowledgebasedwordtokenizationsystemforurdu |