Annotated Lexicon for Sentiment Analysis in the Bosnian Language

The paper presents the first sentiment-annotated lexicon of the Bosnian language. The annotation process and methodology are presented along with a usability study, which concentrates on language coverage. The composition of the starting base was done by translating the Slovenian annotated lexicon a...

Full description

Saved in:
Bibliographic Details
Main Authors: Sead Jahić, Jernej Vičič
Format: Article
Language:English
Published: University of Ljubljana Press (Založba Univerze v Ljubljani) 2023-12-01
Series:Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave
Subjects:
Online Access:https://journals.uni-lj.si/slovenscina2/article/view/11717
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849319321337069568
author Sead Jahić
Jernej Vičič
author_facet Sead Jahić
Jernej Vičič
author_sort Sead Jahić
collection DOAJ
description The paper presents the first sentiment-annotated lexicon of the Bosnian language. The annotation process and methodology are presented along with a usability study, which concentrates on language coverage. The composition of the starting base was done by translating the Slovenian annotated lexicon and later manually checking the translations and annotations. The language coverage was observed using two reference corpora. The Bosnian language is still considered a low-resource language. A reference corpus comprised of automatically crawled web pages is available for the Bosnian language, but the authors had a hard time sourcing any corpora with a clear time frame for the text contained therein. A corpus of contemporary texts was constructed by collecting news articles from several Bosnian web portals. Two language coverage methods were used in this experiment. The first used a frequency list of all words extracted from two reference Bosnian language corpora, and the second ignored the frequencies as the main factor in counting. The computed coverage using the first presented method for the first corpus was 19.24%, while the second corpus yielded 28.05%. The second method yielded 2.34% coverage for the first corpus and 6.98% for the second corpus. The results of the study present a language coverage that is comparable to the state of the art in the field. The usability of the lexicon was already proven in a Twitter-based comparison.
format Article
id doaj-art-6d1a237bf5a04de6815c5699fb9050d9
institution Kabale University
issn 2335-2736
language English
publishDate 2023-12-01
publisher University of Ljubljana Press (Založba Univerze v Ljubljani)
record_format Article
series Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave
spelling doaj-art-6d1a237bf5a04de6815c5699fb9050d92025-08-20T03:50:31ZengUniversity of Ljubljana Press (Založba Univerze v Ljubljani)Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave2335-27362023-12-0111210.4312/slo2.0.2023.2.59-8318085Annotated Lexicon for Sentiment Analysis in the Bosnian LanguageSead Jahić0Jernej Vičič1University of Primorska, Faculty of Mathematics, Natural Science and Information Technologies, Koper, SloveniaUniversity of Primorska, Faculty of Mathematics, Natural Science and Information Technologies, Koper; Research Centre of the Slovenian Academy of Sciences and Arts, Fran Ramovš Institute of the Slovenian Language, Ljubljana, SloveniaThe paper presents the first sentiment-annotated lexicon of the Bosnian language. The annotation process and methodology are presented along with a usability study, which concentrates on language coverage. The composition of the starting base was done by translating the Slovenian annotated lexicon and later manually checking the translations and annotations. The language coverage was observed using two reference corpora. The Bosnian language is still considered a low-resource language. A reference corpus comprised of automatically crawled web pages is available for the Bosnian language, but the authors had a hard time sourcing any corpora with a clear time frame for the text contained therein. A corpus of contemporary texts was constructed by collecting news articles from several Bosnian web portals. Two language coverage methods were used in this experiment. The first used a frequency list of all words extracted from two reference Bosnian language corpora, and the second ignored the frequencies as the main factor in counting. The computed coverage using the first presented method for the first corpus was 19.24%, while the second corpus yielded 28.05%. The second method yielded 2.34% coverage for the first corpus and 6.98% for the second corpus. The results of the study present a language coverage that is comparable to the state of the art in the field. The usability of the lexicon was already proven in a Twitter-based comparison. https://journals.uni-lj.si/slovenscina2/article/view/11717Bosnian lexiconcorpussentiment analysisAnAwordsstopwordslog-likelihood
spellingShingle Sead Jahić
Jernej Vičič
Annotated Lexicon for Sentiment Analysis in the Bosnian Language
Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave
Bosnian lexicon
corpus
sentiment analysis
AnAwords
stopwords
log-likelihood
title Annotated Lexicon for Sentiment Analysis in the Bosnian Language
title_full Annotated Lexicon for Sentiment Analysis in the Bosnian Language
title_fullStr Annotated Lexicon for Sentiment Analysis in the Bosnian Language
title_full_unstemmed Annotated Lexicon for Sentiment Analysis in the Bosnian Language
title_short Annotated Lexicon for Sentiment Analysis in the Bosnian Language
title_sort annotated lexicon for sentiment analysis in the bosnian language
topic Bosnian lexicon
corpus
sentiment analysis
AnAwords
stopwords
log-likelihood
url https://journals.uni-lj.si/slovenscina2/article/view/11717
work_keys_str_mv AT seadjahic annotatedlexiconforsentimentanalysisinthebosnianlanguage
AT jernejvicic annotatedlexiconforsentimentanalysisinthebosnianlanguage