Comparison of Language Models for English-Latvian Semantic Search

In this study, ten language models are explored and compared in an English-Latvian semantic information retrieval setting, where the indexed collection of documents is written in English while the query documents are written in Latvian. Currently, no similar research has been done regarding the Latv...

Full description

Saved in:
Bibliographic Details
Main Authors: Kucheravy Artem, Jēkabsons Gints
Format: Article
Language:English
Published: Sciendo 2025-01-01
Series:Applied Computer Systems
Subjects:
Online Access:https://doi.org/10.2478/acss-2025-0004
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In this study, ten language models are explored and compared in an English-Latvian semantic information retrieval setting, where the indexed collection of documents is written in English while the query documents are written in Latvian. Currently, no similar research has been done regarding the Latvian language. A dataset of 77736 pairs of articles from Latvian and English Wikipedia was created, transformed into embedding vectors, and used for retrieval experiments with brute force search, Hierarchical Navigable Small World method, and Inverted File Indexing method. The LaBSE language model achieved the best performance for short texts and a version of Sentence-BERT and E5-large for long texts.
ISSN:2255-8691