A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian

This study presents a comprehensive evaluation of embedding techniques and large language models (LLMs) for Information Retrieval (IR) and question answering (QA) across languages, focusing on English and Italian. We address a significant research gap by providing empirical evidence of model perform...

Full description

Saved in:
Bibliographic Details
Main Authors: Ermelinda Oro, Francesco Maria Granata, Massimo Ruffolo
Format: Article
Language:English
Published: MDPI AG 2025-05-01
Series:Big Data and Cognitive Computing
Subjects:
Online Access:https://www.mdpi.com/2504-2289/9/5/141
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849327445874835456
author Ermelinda Oro
Francesco Maria Granata
Massimo Ruffolo
author_facet Ermelinda Oro
Francesco Maria Granata
Massimo Ruffolo
author_sort Ermelinda Oro
collection DOAJ
description This study presents a comprehensive evaluation of embedding techniques and large language models (LLMs) for Information Retrieval (IR) and question answering (QA) across languages, focusing on English and Italian. We address a significant research gap by providing empirical evidence of model performance across linguistic boundaries. We evaluate 12 embedding models on diverse IR datasets, including Italian SQuAD and DICE, English SciFact, ArguAna, and NFCorpus. We assess four LLMs (GPT4o, LLama-3.1 8B, Mistral-Nemo, and Gemma-2b) for QA tasks within a retrieval-augmented generation (RAG) pipeline. We evaluate them on SQuAD, CovidQA, and NarrativeQA datasets, including cross-lingual scenarios. The results show multilingual models perform more competitively than language-specific ones. The embed-multilingual-v3.0 model achieves top nDCG@10 scores of 0.90 for English and 0.86 for Italian. In QA evaluation, Mistral-Nemo demonstrates superior answer relevance (0.91–1.0) while maintaining strong groundedness (0.64–0.78). Our analysis reveals three key findings: (1) multilingual embedding models effectively bridge performance gaps between English and Italian, though performance consistency decreases in specialized domains, (2) model size does not consistently predict performance, and (3) all evaluated QA systems exhibit a critical trade-off between answer relevance and factual groundedness. Our evaluation framework combines traditional metrics with innovative LLM-based assessment techniques. It establishes new benchmarks for multilingual language technologies while providing actionable insights for real-world IR and QA system deployment.
format Article
id doaj-art-369bf0fcc9874e8c8c5ec8b963271905
institution Kabale University
issn 2504-2289
language English
publishDate 2025-05-01
publisher MDPI AG
record_format Article
series Big Data and Cognitive Computing
spelling doaj-art-369bf0fcc9874e8c8c5ec8b9632719052025-08-20T03:47:53ZengMDPI AGBig Data and Cognitive Computing2504-22892025-05-019514110.3390/bdcc9050141A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and ItalianErmelinda Oro0Francesco Maria Granata1Massimo Ruffolo2ICAR-CNR—Institute for High Performance Computing and Networking, National Research Council, 87036 Rende, CS, ItalyAltilia srl, TechNest—Incubator of the University of Calabria, 87036 Rende, CS, ItalyAltilia srl, TechNest—Incubator of the University of Calabria, 87036 Rende, CS, ItalyThis study presents a comprehensive evaluation of embedding techniques and large language models (LLMs) for Information Retrieval (IR) and question answering (QA) across languages, focusing on English and Italian. We address a significant research gap by providing empirical evidence of model performance across linguistic boundaries. We evaluate 12 embedding models on diverse IR datasets, including Italian SQuAD and DICE, English SciFact, ArguAna, and NFCorpus. We assess four LLMs (GPT4o, LLama-3.1 8B, Mistral-Nemo, and Gemma-2b) for QA tasks within a retrieval-augmented generation (RAG) pipeline. We evaluate them on SQuAD, CovidQA, and NarrativeQA datasets, including cross-lingual scenarios. The results show multilingual models perform more competitively than language-specific ones. The embed-multilingual-v3.0 model achieves top nDCG@10 scores of 0.90 for English and 0.86 for Italian. In QA evaluation, Mistral-Nemo demonstrates superior answer relevance (0.91–1.0) while maintaining strong groundedness (0.64–0.78). Our analysis reveals three key findings: (1) multilingual embedding models effectively bridge performance gaps between English and Italian, though performance consistency decreases in specialized domains, (2) model size does not consistently predict performance, and (3) all evaluated QA systems exhibit a critical trade-off between answer relevance and factual groundedness. Our evaluation framework combines traditional metrics with innovative LLM-based assessment techniques. It establishes new benchmarks for multilingual language technologies while providing actionable insights for real-world IR and QA system deployment.https://www.mdpi.com/2504-2289/9/5/141multilingual embeddingsinformation retrievallarge language modelsnatural language processingquestion answeringretrieval-augmented generation
spellingShingle Ermelinda Oro
Francesco Maria Granata
Massimo Ruffolo
A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian
Big Data and Cognitive Computing
multilingual embeddings
information retrieval
large language models
natural language processing
question answering
retrieval-augmented generation
title A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian
title_full A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian
title_fullStr A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian
title_full_unstemmed A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian
title_short A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian
title_sort comprehensive evaluation of embedding models and llms for ir and qa across english and italian
topic multilingual embeddings
information retrieval
large language models
natural language processing
question answering
retrieval-augmented generation
url https://www.mdpi.com/2504-2289/9/5/141
work_keys_str_mv AT ermelindaoro acomprehensiveevaluationofembeddingmodelsandllmsforirandqaacrossenglishanditalian
AT francescomariagranata acomprehensiveevaluationofembeddingmodelsandllmsforirandqaacrossenglishanditalian
AT massimoruffolo acomprehensiveevaluationofembeddingmodelsandllmsforirandqaacrossenglishanditalian
AT ermelindaoro comprehensiveevaluationofembeddingmodelsandllmsforirandqaacrossenglishanditalian
AT francescomariagranata comprehensiveevaluationofembeddingmodelsandllmsforirandqaacrossenglishanditalian
AT massimoruffolo comprehensiveevaluationofembeddingmodelsandllmsforirandqaacrossenglishanditalian