Evaluation of event plausibility recognition in Large (Vision)-Language Models

Transformer-based Language Models (LMs) achieve outstanding performances in various tasks but still exhibit limitations in recognizing common world events (GEK), particularly when they require referential information or real-world experience. Assuming that visual knowledge in vision-language models...

Full description

Saved in:
Bibliographic Details
Main Authors: Maria Cassese, Alessandro Bondielli, Alessandro Lenci
Format: Article
Language:English
Published: Accademia University Press 2024-12-01
Series:IJCoL
Online Access:https://journals.openedition.org/ijcol/1422
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849424318747901952
author Maria Cassese
Alessandro Bondielli
Alessandro Lenci
author_facet Maria Cassese
Alessandro Bondielli
Alessandro Lenci
author_sort Maria Cassese
collection DOAJ
description Transformer-based Language Models (LMs) achieve outstanding performances in various tasks but still exhibit limitations in recognizing common world events (GEK), particularly when they require referential information or real-world experience. Assuming that visual knowledge in vision-language models (VLMs) provides additional referential information, this paper tests their ability to leverage implicit event knowledge to acquire robust and generalizable representations of agent-patient interactions, assessing their capacity to distinguish between plausible and implausible events. The analysis was conducted on models of varying sizes and architectures.In the evaluation, the performance of unimodal and multimodal models of various sizes was compared using the task of recognizing the plausibility of minimal sentence pairs. Our analysis suggests several findings: 1) decoder-only models tend to outperform encoder-only ones; 2) the model size has a minor impact: although larger models perform better in absolute terms, the differences between 7B and 13B parameter models are not significant for this particular task; 3) while smaller encoder-only VLMs consistently fall short of their LLM counterpart, larger ones have similar or slightly superior performance; 4) all models have lower performance on the more challenging sentences; 5) adding corresponding images to the textual stimuli affects the accuracy levels of some models. These findings open avenues for further analyses of the inner workings of VLMs and their ability to model event knowledge with and without visual inputs.
format Article
id doaj-art-447460dc69f94c4e843e6d48ddd1854a
institution Kabale University
issn 2499-4553
language English
publishDate 2024-12-01
publisher Accademia University Press
record_format Article
series IJCoL
spelling doaj-art-447460dc69f94c4e843e6d48ddd1854a2025-08-20T03:30:14ZengAccademia University PressIJCoL2499-45532024-12-01102Evaluation of event plausibility recognition in Large (Vision)-Language ModelsMaria CasseseAlessandro BondielliAlessandro LenciTransformer-based Language Models (LMs) achieve outstanding performances in various tasks but still exhibit limitations in recognizing common world events (GEK), particularly when they require referential information or real-world experience. Assuming that visual knowledge in vision-language models (VLMs) provides additional referential information, this paper tests their ability to leverage implicit event knowledge to acquire robust and generalizable representations of agent-patient interactions, assessing their capacity to distinguish between plausible and implausible events. The analysis was conducted on models of varying sizes and architectures.In the evaluation, the performance of unimodal and multimodal models of various sizes was compared using the task of recognizing the plausibility of minimal sentence pairs. Our analysis suggests several findings: 1) decoder-only models tend to outperform encoder-only ones; 2) the model size has a minor impact: although larger models perform better in absolute terms, the differences between 7B and 13B parameter models are not significant for this particular task; 3) while smaller encoder-only VLMs consistently fall short of their LLM counterpart, larger ones have similar or slightly superior performance; 4) all models have lower performance on the more challenging sentences; 5) adding corresponding images to the textual stimuli affects the accuracy levels of some models. These findings open avenues for further analyses of the inner workings of VLMs and their ability to model event knowledge with and without visual inputs.https://journals.openedition.org/ijcol/1422
spellingShingle Maria Cassese
Alessandro Bondielli
Alessandro Lenci
Evaluation of event plausibility recognition in Large (Vision)-Language Models
IJCoL
title Evaluation of event plausibility recognition in Large (Vision)-Language Models
title_full Evaluation of event plausibility recognition in Large (Vision)-Language Models
title_fullStr Evaluation of event plausibility recognition in Large (Vision)-Language Models
title_full_unstemmed Evaluation of event plausibility recognition in Large (Vision)-Language Models
title_short Evaluation of event plausibility recognition in Large (Vision)-Language Models
title_sort evaluation of event plausibility recognition in large vision language models
url https://journals.openedition.org/ijcol/1422
work_keys_str_mv AT mariacassese evaluationofeventplausibilityrecognitioninlargevisionlanguagemodels
AT alessandrobondielli evaluationofeventplausibilityrecognitioninlargevisionlanguagemodels
AT alessandrolenci evaluationofeventplausibilityrecognitioninlargevisionlanguagemodels