SBU-WSD-Corpus: A Sense Annotated Corpus for Persian All-words Word Sense Disambiguation

Word Sense Disambiguation (WSD) is a long standing task in Natural Language Processing (NLP) that aims to automatically identify the most relevant meaning of the words in a given context. Developing standard WSD test collections can be mentioned as an important prerequisite for developing and evalua...

Full description

Saved in:
Bibliographic Details
Main Authors: Hossein Rouhizadeh, Mehrnoush Shamsfard, Vahide Tajalli
Format: Article
Language:English
Published: University of science and culture 2022-07-01
Series:International Journal of Web Research
Subjects:
Online Access:https://ijwr.usc.ac.ir/article_165861_444adb5f7ab0eada122fc44b4722aef8.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841560357079875584
author Hossein Rouhizadeh
Mehrnoush Shamsfard
Vahide Tajalli
author_facet Hossein Rouhizadeh
Mehrnoush Shamsfard
Vahide Tajalli
author_sort Hossein Rouhizadeh
collection DOAJ
description Word Sense Disambiguation (WSD) is a long standing task in Natural Language Processing (NLP) that aims to automatically identify the most relevant meaning of the words in a given context. Developing standard WSD test collections can be mentioned as an important prerequisite for developing and evaluating different WSD systems in the language of interest. Although many WSD test collections have been developed for a variety of languages, no standard All-words WSD benchmark is available for Persian. In this paper, we address this shortage for the Persian language by introducing SBU-WSD-Corpus, as the first standard test set for the Persian All-words WSD task. SBU-WSD-Corpus is manually annotated with senses from the Persian WordNet (FarsNet) sense inventory. To this end, three annotators used SAMP (a tool for sense annotation based on FarsNet lexical graph) to perform the annotation task. SBU-WSD-Corpus consists of 19 Persian documents in different domains such as Sports, Science, Arts, etc. It includes 5892 content words of Persian running text and 3371 manually sense annotated words (2073 nouns, 566 verbs, 610 adjectives, and 122 adverbs). Providing baselines for future studies on the Persian All-words WSD task, we evaluate several WSD models on SBU-WSD-Corpus.
format Article
id doaj-art-ad599d0100b34b9f8df68986f8b1316e
institution Kabale University
issn 2645-4343
language English
publishDate 2022-07-01
publisher University of science and culture
record_format Article
series International Journal of Web Research
spelling doaj-art-ad599d0100b34b9f8df68986f8b1316e2025-01-04T10:24:02ZengUniversity of science and cultureInternational Journal of Web Research2645-43432022-07-0152778510.22133/ijwr.2023.354098.1128SBU-WSD-Corpus: A Sense Annotated Corpus for Persian All-words Word Sense DisambiguationHossein Rouhizadeh0Mehrnoush Shamsfard1https://orcid.org/0000-0002-7027-7529Vahide Tajalli 2Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, IranUniversity of Tehran, Tehran, IranWord Sense Disambiguation (WSD) is a long standing task in Natural Language Processing (NLP) that aims to automatically identify the most relevant meaning of the words in a given context. Developing standard WSD test collections can be mentioned as an important prerequisite for developing and evaluating different WSD systems in the language of interest. Although many WSD test collections have been developed for a variety of languages, no standard All-words WSD benchmark is available for Persian. In this paper, we address this shortage for the Persian language by introducing SBU-WSD-Corpus, as the first standard test set for the Persian All-words WSD task. SBU-WSD-Corpus is manually annotated with senses from the Persian WordNet (FarsNet) sense inventory. To this end, three annotators used SAMP (a tool for sense annotation based on FarsNet lexical graph) to perform the annotation task. SBU-WSD-Corpus consists of 19 Persian documents in different domains such as Sports, Science, Arts, etc. It includes 5892 content words of Persian running text and 3371 manually sense annotated words (2073 nouns, 566 verbs, 610 adjectives, and 122 adverbs). Providing baselines for future studies on the Persian All-words WSD task, we evaluate several WSD models on SBU-WSD-Corpus. https://ijwr.usc.ac.ir/article_165861_444adb5f7ab0eada122fc44b4722aef8.pdfword sense disambiguationwsd corpusall-words wsdpersian language processing
spellingShingle Hossein Rouhizadeh
Mehrnoush Shamsfard
Vahide Tajalli
SBU-WSD-Corpus: A Sense Annotated Corpus for Persian All-words Word Sense Disambiguation
International Journal of Web Research
word sense disambiguation
wsd corpus
all-words wsd
persian language processing
title SBU-WSD-Corpus: A Sense Annotated Corpus for Persian All-words Word Sense Disambiguation
title_full SBU-WSD-Corpus: A Sense Annotated Corpus for Persian All-words Word Sense Disambiguation
title_fullStr SBU-WSD-Corpus: A Sense Annotated Corpus for Persian All-words Word Sense Disambiguation
title_full_unstemmed SBU-WSD-Corpus: A Sense Annotated Corpus for Persian All-words Word Sense Disambiguation
title_short SBU-WSD-Corpus: A Sense Annotated Corpus for Persian All-words Word Sense Disambiguation
title_sort sbu wsd corpus a sense annotated corpus for persian all words word sense disambiguation
topic word sense disambiguation
wsd corpus
all-words wsd
persian language processing
url https://ijwr.usc.ac.ir/article_165861_444adb5f7ab0eada122fc44b4722aef8.pdf
work_keys_str_mv AT hosseinrouhizadeh sbuwsdcorpusasenseannotatedcorpusforpersianallwordswordsensedisambiguation
AT mehrnoushshamsfard sbuwsdcorpusasenseannotatedcorpusforpersianallwordswordsensedisambiguation
AT vahidetajalli sbuwsdcorpusasenseannotatedcorpusforpersianallwordswordsensedisambiguation