A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian
This study presents a comprehensive evaluation of embedding techniques and large language models (LLMs) for Information Retrieval (IR) and question answering (QA) across languages, focusing on English and Italian. We address a significant research gap by providing empirical evidence of model perform...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-05-01
|
| Series: | Big Data and Cognitive Computing |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2504-2289/9/5/141 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849327445874835456 |
|---|---|
| author | Ermelinda Oro Francesco Maria Granata Massimo Ruffolo |
| author_facet | Ermelinda Oro Francesco Maria Granata Massimo Ruffolo |
| author_sort | Ermelinda Oro |
| collection | DOAJ |
| description | This study presents a comprehensive evaluation of embedding techniques and large language models (LLMs) for Information Retrieval (IR) and question answering (QA) across languages, focusing on English and Italian. We address a significant research gap by providing empirical evidence of model performance across linguistic boundaries. We evaluate 12 embedding models on diverse IR datasets, including Italian SQuAD and DICE, English SciFact, ArguAna, and NFCorpus. We assess four LLMs (GPT4o, LLama-3.1 8B, Mistral-Nemo, and Gemma-2b) for QA tasks within a retrieval-augmented generation (RAG) pipeline. We evaluate them on SQuAD, CovidQA, and NarrativeQA datasets, including cross-lingual scenarios. The results show multilingual models perform more competitively than language-specific ones. The embed-multilingual-v3.0 model achieves top nDCG@10 scores of 0.90 for English and 0.86 for Italian. In QA evaluation, Mistral-Nemo demonstrates superior answer relevance (0.91–1.0) while maintaining strong groundedness (0.64–0.78). Our analysis reveals three key findings: (1) multilingual embedding models effectively bridge performance gaps between English and Italian, though performance consistency decreases in specialized domains, (2) model size does not consistently predict performance, and (3) all evaluated QA systems exhibit a critical trade-off between answer relevance and factual groundedness. Our evaluation framework combines traditional metrics with innovative LLM-based assessment techniques. It establishes new benchmarks for multilingual language technologies while providing actionable insights for real-world IR and QA system deployment. |
| format | Article |
| id | doaj-art-369bf0fcc9874e8c8c5ec8b963271905 |
| institution | Kabale University |
| issn | 2504-2289 |
| language | English |
| publishDate | 2025-05-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Big Data and Cognitive Computing |
| spelling | doaj-art-369bf0fcc9874e8c8c5ec8b9632719052025-08-20T03:47:53ZengMDPI AGBig Data and Cognitive Computing2504-22892025-05-019514110.3390/bdcc9050141A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and ItalianErmelinda Oro0Francesco Maria Granata1Massimo Ruffolo2ICAR-CNR—Institute for High Performance Computing and Networking, National Research Council, 87036 Rende, CS, ItalyAltilia srl, TechNest—Incubator of the University of Calabria, 87036 Rende, CS, ItalyAltilia srl, TechNest—Incubator of the University of Calabria, 87036 Rende, CS, ItalyThis study presents a comprehensive evaluation of embedding techniques and large language models (LLMs) for Information Retrieval (IR) and question answering (QA) across languages, focusing on English and Italian. We address a significant research gap by providing empirical evidence of model performance across linguistic boundaries. We evaluate 12 embedding models on diverse IR datasets, including Italian SQuAD and DICE, English SciFact, ArguAna, and NFCorpus. We assess four LLMs (GPT4o, LLama-3.1 8B, Mistral-Nemo, and Gemma-2b) for QA tasks within a retrieval-augmented generation (RAG) pipeline. We evaluate them on SQuAD, CovidQA, and NarrativeQA datasets, including cross-lingual scenarios. The results show multilingual models perform more competitively than language-specific ones. The embed-multilingual-v3.0 model achieves top nDCG@10 scores of 0.90 for English and 0.86 for Italian. In QA evaluation, Mistral-Nemo demonstrates superior answer relevance (0.91–1.0) while maintaining strong groundedness (0.64–0.78). Our analysis reveals three key findings: (1) multilingual embedding models effectively bridge performance gaps between English and Italian, though performance consistency decreases in specialized domains, (2) model size does not consistently predict performance, and (3) all evaluated QA systems exhibit a critical trade-off between answer relevance and factual groundedness. Our evaluation framework combines traditional metrics with innovative LLM-based assessment techniques. It establishes new benchmarks for multilingual language technologies while providing actionable insights for real-world IR and QA system deployment.https://www.mdpi.com/2504-2289/9/5/141multilingual embeddingsinformation retrievallarge language modelsnatural language processingquestion answeringretrieval-augmented generation |
| spellingShingle | Ermelinda Oro Francesco Maria Granata Massimo Ruffolo A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian Big Data and Cognitive Computing multilingual embeddings information retrieval large language models natural language processing question answering retrieval-augmented generation |
| title | A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian |
| title_full | A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian |
| title_fullStr | A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian |
| title_full_unstemmed | A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian |
| title_short | A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian |
| title_sort | comprehensive evaluation of embedding models and llms for ir and qa across english and italian |
| topic | multilingual embeddings information retrieval large language models natural language processing question answering retrieval-augmented generation |
| url | https://www.mdpi.com/2504-2289/9/5/141 |
| work_keys_str_mv | AT ermelindaoro acomprehensiveevaluationofembeddingmodelsandllmsforirandqaacrossenglishanditalian AT francescomariagranata acomprehensiveevaluationofembeddingmodelsandllmsforirandqaacrossenglishanditalian AT massimoruffolo acomprehensiveevaluationofembeddingmodelsandllmsforirandqaacrossenglishanditalian AT ermelindaoro comprehensiveevaluationofembeddingmodelsandllmsforirandqaacrossenglishanditalian AT francescomariagranata comprehensiveevaluationofembeddingmodelsandllmsforirandqaacrossenglishanditalian AT massimoruffolo comprehensiveevaluationofembeddingmodelsandllmsforirandqaacrossenglishanditalian |