Annotated Lexicon for Sentiment Analysis in the Bosnian Language
The paper presents the first sentiment-annotated lexicon of the Bosnian language. The annotation process and methodology are presented along with a usability study, which concentrates on language coverage. The composition of the starting base was done by translating the Slovenian annotated lexicon a...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
University of Ljubljana Press (Založba Univerze v Ljubljani)
2023-12-01
|
| Series: | Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave |
| Subjects: | |
| Online Access: | https://journals.uni-lj.si/slovenscina2/article/view/11717 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849319321337069568 |
|---|---|
| author | Sead Jahić Jernej Vičič |
| author_facet | Sead Jahić Jernej Vičič |
| author_sort | Sead Jahić |
| collection | DOAJ |
| description | The paper presents the first sentiment-annotated lexicon of the Bosnian language. The annotation process and methodology are presented along with a usability study, which concentrates on language coverage. The composition of the starting base was done by translating the Slovenian annotated lexicon and later manually checking the translations and annotations. The language coverage was observed using two reference corpora. The Bosnian language is still considered a low-resource language. A reference corpus comprised of automatically crawled web pages is available for the Bosnian language, but the authors had a hard time sourcing any corpora with a clear time frame for the text contained therein. A corpus of contemporary texts was constructed by collecting news articles from several Bosnian web portals. Two language coverage methods were used in this experiment. The first used a frequency list of all words extracted from two reference Bosnian language corpora, and the second ignored the frequencies as the main factor in counting. The computed coverage using the first presented method for the first corpus was 19.24%, while the second corpus yielded 28.05%. The second method yielded 2.34% coverage for the first corpus and 6.98% for the second corpus. The results of the study present a language coverage that is comparable to the state of the art in the field. The usability of the lexicon was already proven in a Twitter-based comparison.
|
| format | Article |
| id | doaj-art-6d1a237bf5a04de6815c5699fb9050d9 |
| institution | Kabale University |
| issn | 2335-2736 |
| language | English |
| publishDate | 2023-12-01 |
| publisher | University of Ljubljana Press (Založba Univerze v Ljubljani) |
| record_format | Article |
| series | Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave |
| spelling | doaj-art-6d1a237bf5a04de6815c5699fb9050d92025-08-20T03:50:31ZengUniversity of Ljubljana Press (Založba Univerze v Ljubljani)Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave2335-27362023-12-0111210.4312/slo2.0.2023.2.59-8318085Annotated Lexicon for Sentiment Analysis in the Bosnian LanguageSead Jahić0Jernej Vičič1University of Primorska, Faculty of Mathematics, Natural Science and Information Technologies, Koper, SloveniaUniversity of Primorska, Faculty of Mathematics, Natural Science and Information Technologies, Koper; Research Centre of the Slovenian Academy of Sciences and Arts, Fran Ramovš Institute of the Slovenian Language, Ljubljana, SloveniaThe paper presents the first sentiment-annotated lexicon of the Bosnian language. The annotation process and methodology are presented along with a usability study, which concentrates on language coverage. The composition of the starting base was done by translating the Slovenian annotated lexicon and later manually checking the translations and annotations. The language coverage was observed using two reference corpora. The Bosnian language is still considered a low-resource language. A reference corpus comprised of automatically crawled web pages is available for the Bosnian language, but the authors had a hard time sourcing any corpora with a clear time frame for the text contained therein. A corpus of contemporary texts was constructed by collecting news articles from several Bosnian web portals. Two language coverage methods were used in this experiment. The first used a frequency list of all words extracted from two reference Bosnian language corpora, and the second ignored the frequencies as the main factor in counting. The computed coverage using the first presented method for the first corpus was 19.24%, while the second corpus yielded 28.05%. The second method yielded 2.34% coverage for the first corpus and 6.98% for the second corpus. The results of the study present a language coverage that is comparable to the state of the art in the field. The usability of the lexicon was already proven in a Twitter-based comparison. https://journals.uni-lj.si/slovenscina2/article/view/11717Bosnian lexiconcorpussentiment analysisAnAwordsstopwordslog-likelihood |
| spellingShingle | Sead Jahić Jernej Vičič Annotated Lexicon for Sentiment Analysis in the Bosnian Language Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave Bosnian lexicon corpus sentiment analysis AnAwords stopwords log-likelihood |
| title | Annotated Lexicon for Sentiment Analysis in the Bosnian Language |
| title_full | Annotated Lexicon for Sentiment Analysis in the Bosnian Language |
| title_fullStr | Annotated Lexicon for Sentiment Analysis in the Bosnian Language |
| title_full_unstemmed | Annotated Lexicon for Sentiment Analysis in the Bosnian Language |
| title_short | Annotated Lexicon for Sentiment Analysis in the Bosnian Language |
| title_sort | annotated lexicon for sentiment analysis in the bosnian language |
| topic | Bosnian lexicon corpus sentiment analysis AnAwords stopwords log-likelihood |
| url | https://journals.uni-lj.si/slovenscina2/article/view/11717 |
| work_keys_str_mv | AT seadjahic annotatedlexiconforsentimentanalysisinthebosnianlanguage AT jernejvicic annotatedlexiconforsentimentanalysisinthebosnianlanguage |