Comparison of Language Models for English-Latvian Semantic Search

In this study, ten language models are explored and compared in an English-Latvian semantic information retrieval setting, where the indexed collection of documents is written in English while the query documents are written in Latvian. Currently, no similar research has been done regarding the Latv...

Full description

Saved in:
Bibliographic Details
Main Authors: Kucheravy Artem, Jēkabsons Gints
Format: Article
Language:English
Published: Sciendo 2025-01-01
Series:Applied Computer Systems
Subjects:
Online Access:https://doi.org/10.2478/acss-2025-0004
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1823860507132559360
author Kucheravy Artem
Jēkabsons Gints
author_facet Kucheravy Artem
Jēkabsons Gints
author_sort Kucheravy Artem
collection DOAJ
description In this study, ten language models are explored and compared in an English-Latvian semantic information retrieval setting, where the indexed collection of documents is written in English while the query documents are written in Latvian. Currently, no similar research has been done regarding the Latvian language. A dataset of 77736 pairs of articles from Latvian and English Wikipedia was created, transformed into embedding vectors, and used for retrieval experiments with brute force search, Hierarchical Navigable Small World method, and Inverted File Indexing method. The LaBSE language model achieved the best performance for short texts and a version of Sentence-BERT and E5-large for long texts.
format Article
id doaj-art-880887befac5438fa611525e451122c7
institution Kabale University
issn 2255-8691
language English
publishDate 2025-01-01
publisher Sciendo
record_format Article
series Applied Computer Systems
spelling doaj-art-880887befac5438fa611525e451122c72025-02-10T13:25:18ZengSciendoApplied Computer Systems2255-86912025-01-01301343910.2478/acss-2025-0004Comparison of Language Models for English-Latvian Semantic SearchKucheravy Artem0Jēkabsons Gints1Institute of Applied Computer Systems, Riga Technical University, Riga, LatviaInstitute of Applied Computer Systems, Riga Technical University, Riga, LatviaIn this study, ten language models are explored and compared in an English-Latvian semantic information retrieval setting, where the indexed collection of documents is written in English while the query documents are written in Latvian. Currently, no similar research has been done regarding the Latvian language. A dataset of 77736 pairs of articles from Latvian and English Wikipedia was created, transformed into embedding vectors, and used for retrieval experiments with brute force search, Hierarchical Navigable Small World method, and Inverted File Indexing method. The LaBSE language model achieved the best performance for short texts and a version of Sentence-BERT and E5-large for long texts.https://doi.org/10.2478/acss-2025-0004embeddingslanguage modelssemantic searchsentence-transformers
spellingShingle Kucheravy Artem
Jēkabsons Gints
Comparison of Language Models for English-Latvian Semantic Search
Applied Computer Systems
embeddings
language models
semantic search
sentence-transformers
title Comparison of Language Models for English-Latvian Semantic Search
title_full Comparison of Language Models for English-Latvian Semantic Search
title_fullStr Comparison of Language Models for English-Latvian Semantic Search
title_full_unstemmed Comparison of Language Models for English-Latvian Semantic Search
title_short Comparison of Language Models for English-Latvian Semantic Search
title_sort comparison of language models for english latvian semantic search
topic embeddings
language models
semantic search
sentence-transformers
url https://doi.org/10.2478/acss-2025-0004
work_keys_str_mv AT kucheravyartem comparisonoflanguagemodelsforenglishlatviansemanticsearch
AT jekabsonsgints comparisonoflanguagemodelsforenglishlatviansemanticsearch