Establishing vocabulary tests as a benchmark for evaluating large language models.
Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama 2, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific tasks or domain-specific knowledge, they often neglect th...
Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Public Library of Science (PLoS)
2024-01-01
|
| Series: | PLoS ONE |
| Online Access: | https://doi.org/10.1371/journal.pone.0308259 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850115212273778688 |
|---|---|
| author | Gonzalo Martínez Javier Conde Elena Merino-Gómez Beatriz Bermúdez-Margaretto José Alberto Hernández Pedro Reviriego Marc Brysbaert |
| author_facet | Gonzalo Martínez Javier Conde Elena Merino-Gómez Beatriz Bermúdez-Margaretto José Alberto Hernández Pedro Reviriego Marc Brysbaert |
| author_sort | Gonzalo Martínez |
| collection | DOAJ |
| description | Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama 2, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific tasks or domain-specific knowledge, they often neglect the fundamental linguistic aspects of language understanding. In this paper, we advocate for the revival of vocabulary tests as a valuable tool for assessing LLM performance. We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge. These findings shed light on the intricacies of LLM word representations, their learning mechanisms, and performance variations across models and languages. Moreover, the ability to automatically generate and perform vocabulary tests offers new opportunities to expand the approach and provide a more complete picture of LLMs' language skills. |
| format | Article |
| id | doaj-art-46b001e89d794ba3b4abd61afdec9f0c |
| institution | OA Journals |
| issn | 1932-6203 |
| language | English |
| publishDate | 2024-01-01 |
| publisher | Public Library of Science (PLoS) |
| record_format | Article |
| series | PLoS ONE |
| spelling | doaj-art-46b001e89d794ba3b4abd61afdec9f0c2025-08-20T02:36:39ZengPublic Library of Science (PLoS)PLoS ONE1932-62032024-01-011912e030825910.1371/journal.pone.0308259Establishing vocabulary tests as a benchmark for evaluating large language models.Gonzalo MartínezJavier CondeElena Merino-GómezBeatriz Bermúdez-MargarettoJosé Alberto HernándezPedro ReviriegoMarc BrysbaertVocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama 2, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific tasks or domain-specific knowledge, they often neglect the fundamental linguistic aspects of language understanding. In this paper, we advocate for the revival of vocabulary tests as a valuable tool for assessing LLM performance. We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge. These findings shed light on the intricacies of LLM word representations, their learning mechanisms, and performance variations across models and languages. Moreover, the ability to automatically generate and perform vocabulary tests offers new opportunities to expand the approach and provide a more complete picture of LLMs' language skills.https://doi.org/10.1371/journal.pone.0308259 |
| spellingShingle | Gonzalo Martínez Javier Conde Elena Merino-Gómez Beatriz Bermúdez-Margaretto José Alberto Hernández Pedro Reviriego Marc Brysbaert Establishing vocabulary tests as a benchmark for evaluating large language models. PLoS ONE |
| title | Establishing vocabulary tests as a benchmark for evaluating large language models. |
| title_full | Establishing vocabulary tests as a benchmark for evaluating large language models. |
| title_fullStr | Establishing vocabulary tests as a benchmark for evaluating large language models. |
| title_full_unstemmed | Establishing vocabulary tests as a benchmark for evaluating large language models. |
| title_short | Establishing vocabulary tests as a benchmark for evaluating large language models. |
| title_sort | establishing vocabulary tests as a benchmark for evaluating large language models |
| url | https://doi.org/10.1371/journal.pone.0308259 |
| work_keys_str_mv | AT gonzalomartinez establishingvocabularytestsasabenchmarkforevaluatinglargelanguagemodels AT javierconde establishingvocabularytestsasabenchmarkforevaluatinglargelanguagemodels AT elenamerinogomez establishingvocabularytestsasabenchmarkforevaluatinglargelanguagemodels AT beatrizbermudezmargaretto establishingvocabularytestsasabenchmarkforevaluatinglargelanguagemodels AT josealbertohernandez establishingvocabularytestsasabenchmarkforevaluatinglargelanguagemodels AT pedroreviriego establishingvocabularytestsasabenchmarkforevaluatinglargelanguagemodels AT marcbrysbaert establishingvocabularytestsasabenchmarkforevaluatinglargelanguagemodels |