A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian

This study presents a comprehensive evaluation of embedding techniques and large language models (LLMs) for Information Retrieval (IR) and question answering (QA) across languages, focusing on English and Italian. We address a significant research gap by providing empirical evidence of model perform...

Full description

Saved in:
Bibliographic Details
Main Authors: Ermelinda Oro, Francesco Maria Granata, Massimo Ruffolo
Format: Article
Language:English
Published: MDPI AG 2025-05-01
Series:Big Data and Cognitive Computing
Subjects:
Online Access:https://www.mdpi.com/2504-2289/9/5/141
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This study presents a comprehensive evaluation of embedding techniques and large language models (LLMs) for Information Retrieval (IR) and question answering (QA) across languages, focusing on English and Italian. We address a significant research gap by providing empirical evidence of model performance across linguistic boundaries. We evaluate 12 embedding models on diverse IR datasets, including Italian SQuAD and DICE, English SciFact, ArguAna, and NFCorpus. We assess four LLMs (GPT4o, LLama-3.1 8B, Mistral-Nemo, and Gemma-2b) for QA tasks within a retrieval-augmented generation (RAG) pipeline. We evaluate them on SQuAD, CovidQA, and NarrativeQA datasets, including cross-lingual scenarios. The results show multilingual models perform more competitively than language-specific ones. The embed-multilingual-v3.0 model achieves top nDCG@10 scores of 0.90 for English and 0.86 for Italian. In QA evaluation, Mistral-Nemo demonstrates superior answer relevance (0.91–1.0) while maintaining strong groundedness (0.64–0.78). Our analysis reveals three key findings: (1) multilingual embedding models effectively bridge performance gaps between English and Italian, though performance consistency decreases in specialized domains, (2) model size does not consistently predict performance, and (3) all evaluated QA systems exhibit a critical trade-off between answer relevance and factual groundedness. Our evaluation framework combines traditional metrics with innovative LLM-based assessment techniques. It establishes new benchmarks for multilingual language technologies while providing actionable insights for real-world IR and QA system deployment.
ISSN:2504-2289