Generalization bias in large language model summarization of scientific research

Artificial intelligence chatbots driven by large language models (LLMs) have the potential to increase public science literacy and support scientific research, as they can quickly summarize complex scientific information in accessible terms. However, when summarizing scientific texts, LLMs may omit...

Full description

Saved in:

Bibliographic Details
Main Authors:	Uwe Peters, Benjamin Chin-Yee
Format:	Article
Language:	English
Published:	The Royal Society 2025-04-01
Series:	Royal Society Open Science
Subjects:	large language models algorithmic bias science communication overgeneralization
Online Access:	https://royalsocietypublishing.org/doi/10.1098/rsos.241776
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850041414381993984
author	Uwe Peters Benjamin Chin-Yee
author_facet	Uwe Peters Benjamin Chin-Yee
author_sort	Uwe Peters
collection	DOAJ
description	Artificial intelligence chatbots driven by large language models (LLMs) have the potential to increase public science literacy and support scientific research, as they can quickly summarize complex scientific information in accessible terms. However, when summarizing scientific texts, LLMs may omit details that limit the scope of research conclusions, leading to generalizations of results broader than warranted by the original study. We tested 10 prominent LLMs, including ChatGPT-4o, ChatGPT-4.5, DeepSeek, LLaMA 3.3 70B, and Claude 3.7 Sonnet, comparing 4900 LLM-generated summaries to their original scientific texts. Even when explicitly prompted for accuracy, most LLMs produced broader generalizations of scientific results than those in the original texts, with DeepSeek, ChatGPT-4o, and LLaMA 3.3 70B overgeneralizing in 26–73% of cases. In a direct comparison of LLM-generated and human-authored science summaries, LLM summaries were nearly five times more likely to contain broad generalizations (odds ratio = 4.85, 95% CI [3.06, 7.70], p < 0.001). Notably, newer models tended to perform worse in generalization accuracy than earlier ones. Our results indicate a strong bias in many widely used LLMs towards overgeneralizing scientific conclusions, posing a significant risk of large-scale misinterpretations of research findings. We highlight potential mitigation strategies, including lowering LLM temperature settings and benchmarking LLMs for generalization accuracy.
format	Article
id	doaj-art-0cfbd3437abe465db828960ce8924d2e
institution	DOAJ
issn	2054-5703
language	English
publishDate	2025-04-01
publisher	The Royal Society
record_format	Article
series	Royal Society Open Science
spelling	doaj-art-0cfbd3437abe465db828960ce8924d2e2025-08-20T02:55:48ZengThe Royal SocietyRoyal Society Open Science2054-57032025-04-0112410.1098/rsos.241776Generalization bias in large language model summarization of scientific researchUwe Peters0Benjamin Chin-Yee1Utrecht University, Utrecht, The NetherlandsWestern University, London, CanadaArtificial intelligence chatbots driven by large language models (LLMs) have the potential to increase public science literacy and support scientific research, as they can quickly summarize complex scientific information in accessible terms. However, when summarizing scientific texts, LLMs may omit details that limit the scope of research conclusions, leading to generalizations of results broader than warranted by the original study. We tested 10 prominent LLMs, including ChatGPT-4o, ChatGPT-4.5, DeepSeek, LLaMA 3.3 70B, and Claude 3.7 Sonnet, comparing 4900 LLM-generated summaries to their original scientific texts. Even when explicitly prompted for accuracy, most LLMs produced broader generalizations of scientific results than those in the original texts, with DeepSeek, ChatGPT-4o, and LLaMA 3.3 70B overgeneralizing in 26–73% of cases. In a direct comparison of LLM-generated and human-authored science summaries, LLM summaries were nearly five times more likely to contain broad generalizations (odds ratio = 4.85, 95% CI [3.06, 7.70], p < 0.001). Notably, newer models tended to perform worse in generalization accuracy than earlier ones. Our results indicate a strong bias in many widely used LLMs towards overgeneralizing scientific conclusions, posing a significant risk of large-scale misinterpretations of research findings. We highlight potential mitigation strategies, including lowering LLM temperature settings and benchmarking LLMs for generalization accuracy.https://royalsocietypublishing.org/doi/10.1098/rsos.241776large language modelsalgorithmic biasscience communicationovergeneralization
spellingShingle	Uwe Peters Benjamin Chin-Yee Generalization bias in large language model summarization of scientific research Royal Society Open Science large language models algorithmic bias science communication overgeneralization
title	Generalization bias in large language model summarization of scientific research
title_full	Generalization bias in large language model summarization of scientific research
title_fullStr	Generalization bias in large language model summarization of scientific research
title_full_unstemmed	Generalization bias in large language model summarization of scientific research
title_short	Generalization bias in large language model summarization of scientific research
title_sort	generalization bias in large language model summarization of scientific research
topic	large language models algorithmic bias science communication overgeneralization
url	https://royalsocietypublishing.org/doi/10.1098/rsos.241776
work_keys_str_mv	AT uwepeters generalizationbiasinlargelanguagemodelsummarizationofscientificresearch AT benjaminchinyee generalizationbiasinlargelanguagemodelsummarizationofscientificresearch

Generalization bias in large language model summarization of scientific research

Similar Items