A publicly available benchmark for assessing large language models’ ability to predict how humans balance self-interest and the interest of others

Abstract Large language models (LLMs) hold enormous potential to assist humans in decision-making processes, from everyday to high-stake scenarios. However, as many human decisions carry social implications, for LLMs to be reliable assistants a necessary prerequisite is that they are able to capture...

Full description

Saved in:

Bibliographic Details
Main Authors:	Valerio Capraro, Roberto Di Paolo, Veronica Pizziol
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-07-01
Series:	Scientific Reports
Subjects:	Generative artificial intelligence Human behavior Economic games Dictator game Altruism
Online Access:	https://doi.org/10.1038/s41598-025-01715-7
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849334354140987392
author	Valerio Capraro Roberto Di Paolo Veronica Pizziol
author_facet	Valerio Capraro Roberto Di Paolo Veronica Pizziol
author_sort	Valerio Capraro
collection	DOAJ
description	Abstract Large language models (LLMs) hold enormous potential to assist humans in decision-making processes, from everyday to high-stake scenarios. However, as many human decisions carry social implications, for LLMs to be reliable assistants a necessary prerequisite is that they are able to capture how humans balance self-interest and the interest of others. Here we introduce a novel, publicly available, benchmark to test LLM’s ability to predict how humans balance monetary self-interest and the interest of others. This benchmark consists of 106 textual instructions from dictator games experiments conducted with human participants from 12 countries, alongside with a compendium of actual human behavior in each experiment. We investigate the ability of four advanced chatbots against this benchmark. We find that none of these chatbots meet the benchmark. In particular, only GPT-4 and GPT-4o (not Bard nor Bing) correctly capture qualitative behavioral patterns, identifying three major classes of behavior: self-interested, inequity-averse, and fully altruistic. Nonetheless, GPT-4 and GPT-4o consistently underestimate self-interest, while overestimating altruistic behavior. In sum, this article introduces a publicly available resource for testing the capacity of LLMs to estimate human other-regarding preferences in economic decisions and reveals an “optimistic bias” in current versions of GPT.
format	Article
id	doaj-art-e1d1ca36370443169d604d82fb1b6357
institution	Kabale University
issn	2045-2322
language	English
publishDate	2025-07-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj-art-e1d1ca36370443169d604d82fb1b63572025-08-20T03:45:35ZengNature PortfolioScientific Reports2045-23222025-07-0115111110.1038/s41598-025-01715-7A publicly available benchmark for assessing large language models’ ability to predict how humans balance self-interest and the interest of othersValerio Capraro0Roberto Di Paolo1Veronica Pizziol2Department of Psychology, University of Milan BicoccaDepartment of Economics and Management, University of ParmaDepartment of Economics, University of BolognaAbstract Large language models (LLMs) hold enormous potential to assist humans in decision-making processes, from everyday to high-stake scenarios. However, as many human decisions carry social implications, for LLMs to be reliable assistants a necessary prerequisite is that they are able to capture how humans balance self-interest and the interest of others. Here we introduce a novel, publicly available, benchmark to test LLM’s ability to predict how humans balance monetary self-interest and the interest of others. This benchmark consists of 106 textual instructions from dictator games experiments conducted with human participants from 12 countries, alongside with a compendium of actual human behavior in each experiment. We investigate the ability of four advanced chatbots against this benchmark. We find that none of these chatbots meet the benchmark. In particular, only GPT-4 and GPT-4o (not Bard nor Bing) correctly capture qualitative behavioral patterns, identifying three major classes of behavior: self-interested, inequity-averse, and fully altruistic. Nonetheless, GPT-4 and GPT-4o consistently underestimate self-interest, while overestimating altruistic behavior. In sum, this article introduces a publicly available resource for testing the capacity of LLMs to estimate human other-regarding preferences in economic decisions and reveals an “optimistic bias” in current versions of GPT.https://doi.org/10.1038/s41598-025-01715-7Generative artificial intelligenceHuman behaviorEconomic gamesDictator gameAltruism
spellingShingle	Valerio Capraro Roberto Di Paolo Veronica Pizziol A publicly available benchmark for assessing large language models’ ability to predict how humans balance self-interest and the interest of others Scientific Reports Generative artificial intelligence Human behavior Economic games Dictator game Altruism
title	A publicly available benchmark for assessing large language models’ ability to predict how humans balance self-interest and the interest of others
title_full	A publicly available benchmark for assessing large language models’ ability to predict how humans balance self-interest and the interest of others
title_fullStr	A publicly available benchmark for assessing large language models’ ability to predict how humans balance self-interest and the interest of others
title_full_unstemmed	A publicly available benchmark for assessing large language models’ ability to predict how humans balance self-interest and the interest of others
title_short	A publicly available benchmark for assessing large language models’ ability to predict how humans balance self-interest and the interest of others
title_sort	publicly available benchmark for assessing large language models ability to predict how humans balance self interest and the interest of others
topic	Generative artificial intelligence Human behavior Economic games Dictator game Altruism
url	https://doi.org/10.1038/s41598-025-01715-7
work_keys_str_mv	AT valeriocapraro apubliclyavailablebenchmarkforassessinglargelanguagemodelsabilitytopredicthowhumansbalanceselfinterestandtheinterestofothers AT robertodipaolo apubliclyavailablebenchmarkforassessinglargelanguagemodelsabilitytopredicthowhumansbalanceselfinterestandtheinterestofothers AT veronicapizziol apubliclyavailablebenchmarkforassessinglargelanguagemodelsabilitytopredicthowhumansbalanceselfinterestandtheinterestofothers AT valeriocapraro publiclyavailablebenchmarkforassessinglargelanguagemodelsabilitytopredicthowhumansbalanceselfinterestandtheinterestofothers AT robertodipaolo publiclyavailablebenchmarkforassessinglargelanguagemodelsabilitytopredicthowhumansbalanceselfinterestandtheinterestofothers AT veronicapizziol publiclyavailablebenchmarkforassessinglargelanguagemodelsabilitytopredicthowhumansbalanceselfinterestandtheinterestofothers

A publicly available benchmark for assessing large language models’ ability to predict how humans balance self-interest and the interest of others

Similar Items