Cword2vec: a novel morphological rule-based word embedding approach for Urdu text sentiment analysis

Word embeddings are essential to natural language processing tasks because they contain a single word’s syntactic and semantic information. Word embeddings have been developed widely for numerous spoken languages across the globe like English. The research community needs to pay more attention to th...

Full description

Saved in:
Bibliographic Details
Main Authors: Saquib Khushhal, Abdul Majid, Syed Ali Abass, Rabia Riaz, Mohammad Babar, Shafiq Ahmad
Format: Article
Language:English
Published: PeerJ Inc. 2025-07-01
Series:PeerJ Computer Science
Subjects:
Online Access:https://peerj.com/articles/cs-2937.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Word embeddings are essential to natural language processing tasks because they contain a single word’s syntactic and semantic information. Word embeddings have been developed widely for numerous spoken languages across the globe like English. The research community needs to pay more attention to the Urdu language despite its significant number of speakers, which amounts to approximately 231.3 million individuals. Urdu is a complex language because word boundaries in Urdu are unspecified, as it does not employ delimiters between words. The compound word, a multiword expression, is a more complex word consisting of many strings or independent base words. Traditionally, compound words are identified during the word segmentation using bigram or trigram approaches. The challenge with these techniques is that they do not produce meaningful words. This study uses morphological rule-based compound words in Urdu text documents. For text representation, a self-trained morphological rule-based compound word embedding (Cword2vec) based on the word2vec model is proposed for Urdu text sentiment analysis. The performance of self-trained morphological rule-based compound word embedding was then evaluated using four well-known deep learning models, i.e., long short-term memory (LSTM), bidirectional LSTM (BiLSTM), convolutional neural networks (CNN), and convolutional LSTM (C-LSTM) for sentiment analysis. We also compare the performance of morphological rule-based compound words with traditional compound word identification techniques such as bigrams and trigrams. Regardless of the classification model, word embedding using our proposed morphological rule-based compound words outperformed in terms of precision, recall, F1 score, and accuracy than bigrams and trigrams.
ISSN:2376-5992