Generalization bias in large language model summarization of scientific research

Artificial intelligence chatbots driven by large language models (LLMs) have the potential to increase public science literacy and support scientific research, as they can quickly summarize complex scientific information in accessible terms. However, when summarizing scientific texts, LLMs may omit...

Full description

Saved in:
Bibliographic Details
Main Authors: Uwe Peters, Benjamin Chin-Yee
Format: Article
Language:English
Published: The Royal Society 2025-04-01
Series:Royal Society Open Science
Subjects:
Online Access:https://royalsocietypublishing.org/doi/10.1098/rsos.241776
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850041414381993984
author Uwe Peters
Benjamin Chin-Yee
author_facet Uwe Peters
Benjamin Chin-Yee
author_sort Uwe Peters
collection DOAJ
description Artificial intelligence chatbots driven by large language models (LLMs) have the potential to increase public science literacy and support scientific research, as they can quickly summarize complex scientific information in accessible terms. However, when summarizing scientific texts, LLMs may omit details that limit the scope of research conclusions, leading to generalizations of results broader than warranted by the original study. We tested 10 prominent LLMs, including ChatGPT-4o, ChatGPT-4.5, DeepSeek, LLaMA 3.3 70B, and Claude 3.7 Sonnet, comparing 4900 LLM-generated summaries to their original scientific texts. Even when explicitly prompted for accuracy, most LLMs produced broader generalizations of scientific results than those in the original texts, with DeepSeek, ChatGPT-4o, and LLaMA 3.3 70B overgeneralizing in 26–73% of cases. In a direct comparison of LLM-generated and human-authored science summaries, LLM summaries were nearly five times more likely to contain broad generalizations (odds ratio = 4.85, 95% CI [3.06, 7.70], p < 0.001). Notably, newer models tended to perform worse in generalization accuracy than earlier ones. Our results indicate a strong bias in many widely used LLMs towards overgeneralizing scientific conclusions, posing a significant risk of large-scale misinterpretations of research findings. We highlight potential mitigation strategies, including lowering LLM temperature settings and benchmarking LLMs for generalization accuracy.
format Article
id doaj-art-0cfbd3437abe465db828960ce8924d2e
institution DOAJ
issn 2054-5703
language English
publishDate 2025-04-01
publisher The Royal Society
record_format Article
series Royal Society Open Science
spelling doaj-art-0cfbd3437abe465db828960ce8924d2e2025-08-20T02:55:48ZengThe Royal SocietyRoyal Society Open Science2054-57032025-04-0112410.1098/rsos.241776Generalization bias in large language model summarization of scientific researchUwe Peters0Benjamin Chin-Yee1Utrecht University, Utrecht, The NetherlandsWestern University, London, CanadaArtificial intelligence chatbots driven by large language models (LLMs) have the potential to increase public science literacy and support scientific research, as they can quickly summarize complex scientific information in accessible terms. However, when summarizing scientific texts, LLMs may omit details that limit the scope of research conclusions, leading to generalizations of results broader than warranted by the original study. We tested 10 prominent LLMs, including ChatGPT-4o, ChatGPT-4.5, DeepSeek, LLaMA 3.3 70B, and Claude 3.7 Sonnet, comparing 4900 LLM-generated summaries to their original scientific texts. Even when explicitly prompted for accuracy, most LLMs produced broader generalizations of scientific results than those in the original texts, with DeepSeek, ChatGPT-4o, and LLaMA 3.3 70B overgeneralizing in 26–73% of cases. In a direct comparison of LLM-generated and human-authored science summaries, LLM summaries were nearly five times more likely to contain broad generalizations (odds ratio = 4.85, 95% CI [3.06, 7.70], p < 0.001). Notably, newer models tended to perform worse in generalization accuracy than earlier ones. Our results indicate a strong bias in many widely used LLMs towards overgeneralizing scientific conclusions, posing a significant risk of large-scale misinterpretations of research findings. We highlight potential mitigation strategies, including lowering LLM temperature settings and benchmarking LLMs for generalization accuracy.https://royalsocietypublishing.org/doi/10.1098/rsos.241776large language modelsalgorithmic biasscience communicationovergeneralization
spellingShingle Uwe Peters
Benjamin Chin-Yee
Generalization bias in large language model summarization of scientific research
Royal Society Open Science
large language models
algorithmic bias
science communication
overgeneralization
title Generalization bias in large language model summarization of scientific research
title_full Generalization bias in large language model summarization of scientific research
title_fullStr Generalization bias in large language model summarization of scientific research
title_full_unstemmed Generalization bias in large language model summarization of scientific research
title_short Generalization bias in large language model summarization of scientific research
title_sort generalization bias in large language model summarization of scientific research
topic large language models
algorithmic bias
science communication
overgeneralization
url https://royalsocietypublishing.org/doi/10.1098/rsos.241776
work_keys_str_mv AT uwepeters generalizationbiasinlargelanguagemodelsummarizationofscientificresearch
AT benjaminchinyee generalizationbiasinlargelanguagemodelsummarizationofscientificresearch