Establishing vocabulary tests as a benchmark for evaluating large language models.

Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama 2, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific tasks or domain-specific knowledge, they often neglect th...

Full description

Saved in:
Bibliographic Details
Main Authors: Gonzalo Martínez, Javier Conde, Elena Merino-Gómez, Beatriz Bermúdez-Margaretto, José Alberto Hernández, Pedro Reviriego, Marc Brysbaert
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2024-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0308259
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850115212273778688
author Gonzalo Martínez
Javier Conde
Elena Merino-Gómez
Beatriz Bermúdez-Margaretto
José Alberto Hernández
Pedro Reviriego
Marc Brysbaert
author_facet Gonzalo Martínez
Javier Conde
Elena Merino-Gómez
Beatriz Bermúdez-Margaretto
José Alberto Hernández
Pedro Reviriego
Marc Brysbaert
author_sort Gonzalo Martínez
collection DOAJ
description Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama 2, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific tasks or domain-specific knowledge, they often neglect the fundamental linguistic aspects of language understanding. In this paper, we advocate for the revival of vocabulary tests as a valuable tool for assessing LLM performance. We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge. These findings shed light on the intricacies of LLM word representations, their learning mechanisms, and performance variations across models and languages. Moreover, the ability to automatically generate and perform vocabulary tests offers new opportunities to expand the approach and provide a more complete picture of LLMs' language skills.
format Article
id doaj-art-46b001e89d794ba3b4abd61afdec9f0c
institution OA Journals
issn 1932-6203
language English
publishDate 2024-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj-art-46b001e89d794ba3b4abd61afdec9f0c2025-08-20T02:36:39ZengPublic Library of Science (PLoS)PLoS ONE1932-62032024-01-011912e030825910.1371/journal.pone.0308259Establishing vocabulary tests as a benchmark for evaluating large language models.Gonzalo MartínezJavier CondeElena Merino-GómezBeatriz Bermúdez-MargarettoJosé Alberto HernándezPedro ReviriegoMarc BrysbaertVocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama 2, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific tasks or domain-specific knowledge, they often neglect the fundamental linguistic aspects of language understanding. In this paper, we advocate for the revival of vocabulary tests as a valuable tool for assessing LLM performance. We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge. These findings shed light on the intricacies of LLM word representations, their learning mechanisms, and performance variations across models and languages. Moreover, the ability to automatically generate and perform vocabulary tests offers new opportunities to expand the approach and provide a more complete picture of LLMs' language skills.https://doi.org/10.1371/journal.pone.0308259
spellingShingle Gonzalo Martínez
Javier Conde
Elena Merino-Gómez
Beatriz Bermúdez-Margaretto
José Alberto Hernández
Pedro Reviriego
Marc Brysbaert
Establishing vocabulary tests as a benchmark for evaluating large language models.
PLoS ONE
title Establishing vocabulary tests as a benchmark for evaluating large language models.
title_full Establishing vocabulary tests as a benchmark for evaluating large language models.
title_fullStr Establishing vocabulary tests as a benchmark for evaluating large language models.
title_full_unstemmed Establishing vocabulary tests as a benchmark for evaluating large language models.
title_short Establishing vocabulary tests as a benchmark for evaluating large language models.
title_sort establishing vocabulary tests as a benchmark for evaluating large language models
url https://doi.org/10.1371/journal.pone.0308259
work_keys_str_mv AT gonzalomartinez establishingvocabularytestsasabenchmarkforevaluatinglargelanguagemodels
AT javierconde establishingvocabularytestsasabenchmarkforevaluatinglargelanguagemodels
AT elenamerinogomez establishingvocabularytestsasabenchmarkforevaluatinglargelanguagemodels
AT beatrizbermudezmargaretto establishingvocabularytestsasabenchmarkforevaluatinglargelanguagemodels
AT josealbertohernandez establishingvocabularytestsasabenchmarkforevaluatinglargelanguagemodels
AT pedroreviriego establishingvocabularytestsasabenchmarkforevaluatinglargelanguagemodels
AT marcbrysbaert establishingvocabularytestsasabenchmarkforevaluatinglargelanguagemodels