Large Language Model–Assisted Risk-of-Bias Assessment in Randomized Controlled Trials Using the Revised Risk-of-Bias Tool: Usability Study

BackgroundThe revised Risk-of-Bias tool (RoB2) overcomes the limitations of its predecessor but introduces new implementation challenges. Studies demonstrate low interrater reliability and substantial time requirements for RoB2 implementation. Large language models (LLMs) may...

Full description

Saved in:

Bibliographic Details
Main Authors:	Jiajie Huang, Honghao Lai, Weilong Zhao, Danni Xia, Chunyang Bai, Mingyao Sun, Jianing Liu, Jiayi Liu, Bei Pan, Jinhui Tian, Long Ge
Format:	Article
Language:	English
Published:	JMIR Publications 2025-06-01
Series:	Journal of Medical Internet Research
Online Access:	https://www.jmir.org/2025/1/e70450
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849683148203360256
author	Jiajie Huang Honghao Lai Weilong Zhao Danni Xia Chunyang Bai Mingyao Sun Jianing Liu Jiayi Liu Bei Pan Jinhui Tian Long Ge
author_facet	Jiajie Huang Honghao Lai Weilong Zhao Danni Xia Chunyang Bai Mingyao Sun Jianing Liu Jiayi Liu Bei Pan Jinhui Tian Long Ge
author_sort	Jiajie Huang
collection	DOAJ
description	BackgroundThe revised Risk-of-Bias tool (RoB2) overcomes the limitations of its predecessor but introduces new implementation challenges. Studies demonstrate low interrater reliability and substantial time requirements for RoB2 implementation. Large language models (LLMs) may assist in RoB2 implementation, although their effectiveness remains uncertain. ObjectiveThis study aims to evaluate the accuracy of LLMs in RoB2 assessments to explore their potential as research assistants for bias evaluation. MethodsWe systematically searched the Cochrane Library (through October 2023) for reviews using RoB2, categorized by interest in adhering or assignment. From 86 eligible reviews of randomized controlled trials (covering 1399 RCTs), we randomly selected 46 RCTs (23 per category). In addition, 3 experienced reviewers independently assessed all 46 RCTs using RoB2, recording assessment time for each trial. Reviewer judgments were reconciled through consensus. Furthermore, 6 RCTs (3 from each category) were randomly selected for prompt development and optimization. The remaining 40 trials established the internal validation standard, while Cochrane Reviews judgments served as external validation. Primary outcomes were extracted as reported in corresponding Cochrane Reviews. We calculated accuracy rates, Cohen κ, and time differentials. ResultsWe identified significant differences between Cochrane and reviewer judgments, particularly in domains 1, 4, and 5, likely due to different standards in assessing randomization and blinding. Among the 20 articles focusing on adhering, 18 Cochrane Reviews and 19 reviewer judgments classified them as “High risk,” while assignment-focused RCTs showed more heterogeneous risk distribution. Compared with Cochrane Reviews, LLMs demonstrated accuracy rates of 57.5% and 70% for overall (assignment) and overall (adhering), respectively. When compared with reviewer judgments, LLMs’ accuracy rates were 65% and 70% for these domains. The average accuracy rates for the remaining 6 domains were 65.2% (95% CI 57.6-72.7) against Cochrane Reviews and 74.2% (95% CI 64.7-83.9) against reviewers. At the signaling question level, LLMs achieved 83.2% average accuracy (95% CI 77.5-88.9), with accuracy exceeding 70% for most questions except 2.4 (assignment), 2.5 (assignment), 3.3, and 3.4. When domain judgments were derived from LLM-generated signaling questions using the RoB2 algorithm rather than direct LLM domain judgments, accuracy improved substantially for Domain 2 (adhering; 55-95) and overall (adhering; 70-90). LLMs demonstrated high consistency between iterations (average 85.2%, 95% CI 85.15-88.79) and completed assessments in 1.9 minutes versus 31.5 minutes for human reviewers (mean difference 29.6, 95% CI 25.6-33.6 minutes). ConclusionsLLMs achieved commendable accuracy when guided by structured prompts, particularly through processing methodological details through structured reasoning. While not replacing human assessment, LLMs demonstrate strong potential for assisting RoB2 evaluations. Larger studies with improved prompting could enhance performance.
format	Article
id	doaj-art-21c4dd89ea9c4209adf4ea681df1e6d0
institution	DOAJ
issn	1438-8871
language	English
publishDate	2025-06-01
publisher	JMIR Publications
record_format	Article
series	Journal of Medical Internet Research
spelling	doaj-art-21c4dd89ea9c4209adf4ea681df1e6d02025-08-20T03:23:59ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-06-0127e7045010.2196/70450Large Language Model–Assisted Risk-of-Bias Assessment in Randomized Controlled Trials Using the Revised Risk-of-Bias Tool: Usability StudyJiajie Huanghttps://orcid.org/0000-0001-7925-5409Honghao Laihttps://orcid.org/0000-0001-7913-6207Weilong Zhaohttps://orcid.org/0009-0004-1725-723XDanni Xiahttps://orcid.org/0009-0004-2226-6923Chunyang Baihttps://orcid.org/0009-0002-4183-686XMingyao Sunhttps://orcid.org/0009-0005-4457-6401Jianing Liuhttps://orcid.org/0009-0006-7140-1209Jiayi Liuhttps://orcid.org/0009-0001-5127-2116Bei Panhttps://orcid.org/0000-0002-0370-9571Jinhui Tianhttps://orcid.org/0000-0002-0054-2454Long Gehttps://orcid.org/0000-0002-3555-1107 BackgroundThe revised Risk-of-Bias tool (RoB2) overcomes the limitations of its predecessor but introduces new implementation challenges. Studies demonstrate low interrater reliability and substantial time requirements for RoB2 implementation. Large language models (LLMs) may assist in RoB2 implementation, although their effectiveness remains uncertain. ObjectiveThis study aims to evaluate the accuracy of LLMs in RoB2 assessments to explore their potential as research assistants for bias evaluation. MethodsWe systematically searched the Cochrane Library (through October 2023) for reviews using RoB2, categorized by interest in adhering or assignment. From 86 eligible reviews of randomized controlled trials (covering 1399 RCTs), we randomly selected 46 RCTs (23 per category). In addition, 3 experienced reviewers independently assessed all 46 RCTs using RoB2, recording assessment time for each trial. Reviewer judgments were reconciled through consensus. Furthermore, 6 RCTs (3 from each category) were randomly selected for prompt development and optimization. The remaining 40 trials established the internal validation standard, while Cochrane Reviews judgments served as external validation. Primary outcomes were extracted as reported in corresponding Cochrane Reviews. We calculated accuracy rates, Cohen κ, and time differentials. ResultsWe identified significant differences between Cochrane and reviewer judgments, particularly in domains 1, 4, and 5, likely due to different standards in assessing randomization and blinding. Among the 20 articles focusing on adhering, 18 Cochrane Reviews and 19 reviewer judgments classified them as “High risk,” while assignment-focused RCTs showed more heterogeneous risk distribution. Compared with Cochrane Reviews, LLMs demonstrated accuracy rates of 57.5% and 70% for overall (assignment) and overall (adhering), respectively. When compared with reviewer judgments, LLMs’ accuracy rates were 65% and 70% for these domains. The average accuracy rates for the remaining 6 domains were 65.2% (95% CI 57.6-72.7) against Cochrane Reviews and 74.2% (95% CI 64.7-83.9) against reviewers. At the signaling question level, LLMs achieved 83.2% average accuracy (95% CI 77.5-88.9), with accuracy exceeding 70% for most questions except 2.4 (assignment), 2.5 (assignment), 3.3, and 3.4. When domain judgments were derived from LLM-generated signaling questions using the RoB2 algorithm rather than direct LLM domain judgments, accuracy improved substantially for Domain 2 (adhering; 55-95) and overall (adhering; 70-90). LLMs demonstrated high consistency between iterations (average 85.2%, 95% CI 85.15-88.79) and completed assessments in 1.9 minutes versus 31.5 minutes for human reviewers (mean difference 29.6, 95% CI 25.6-33.6 minutes). ConclusionsLLMs achieved commendable accuracy when guided by structured prompts, particularly through processing methodological details through structured reasoning. While not replacing human assessment, LLMs demonstrate strong potential for assisting RoB2 evaluations. Larger studies with improved prompting could enhance performance.https://www.jmir.org/2025/1/e70450
spellingShingle	Jiajie Huang Honghao Lai Weilong Zhao Danni Xia Chunyang Bai Mingyao Sun Jianing Liu Jiayi Liu Bei Pan Jinhui Tian Long Ge Large Language Model–Assisted Risk-of-Bias Assessment in Randomized Controlled Trials Using the Revised Risk-of-Bias Tool: Usability Study Journal of Medical Internet Research
title	Large Language Model–Assisted Risk-of-Bias Assessment in Randomized Controlled Trials Using the Revised Risk-of-Bias Tool: Usability Study
title_full	Large Language Model–Assisted Risk-of-Bias Assessment in Randomized Controlled Trials Using the Revised Risk-of-Bias Tool: Usability Study
title_fullStr	Large Language Model–Assisted Risk-of-Bias Assessment in Randomized Controlled Trials Using the Revised Risk-of-Bias Tool: Usability Study
title_full_unstemmed	Large Language Model–Assisted Risk-of-Bias Assessment in Randomized Controlled Trials Using the Revised Risk-of-Bias Tool: Usability Study
title_short	Large Language Model–Assisted Risk-of-Bias Assessment in Randomized Controlled Trials Using the Revised Risk-of-Bias Tool: Usability Study
title_sort	large language model assisted risk of bias assessment in randomized controlled trials using the revised risk of bias tool usability study
url	https://www.jmir.org/2025/1/e70450
work_keys_str_mv	AT jiajiehuang largelanguagemodelassistedriskofbiasassessmentinrandomizedcontrolledtrialsusingtherevisedriskofbiastoolusabilitystudy AT honghaolai largelanguagemodelassistedriskofbiasassessmentinrandomizedcontrolledtrialsusingtherevisedriskofbiastoolusabilitystudy AT weilongzhao largelanguagemodelassistedriskofbiasassessmentinrandomizedcontrolledtrialsusingtherevisedriskofbiastoolusabilitystudy AT dannixia largelanguagemodelassistedriskofbiasassessmentinrandomizedcontrolledtrialsusingtherevisedriskofbiastoolusabilitystudy AT chunyangbai largelanguagemodelassistedriskofbiasassessmentinrandomizedcontrolledtrialsusingtherevisedriskofbiastoolusabilitystudy AT mingyaosun largelanguagemodelassistedriskofbiasassessmentinrandomizedcontrolledtrialsusingtherevisedriskofbiastoolusabilitystudy AT jianingliu largelanguagemodelassistedriskofbiasassessmentinrandomizedcontrolledtrialsusingtherevisedriskofbiastoolusabilitystudy AT jiayiliu largelanguagemodelassistedriskofbiasassessmentinrandomizedcontrolledtrialsusingtherevisedriskofbiastoolusabilitystudy AT beipan largelanguagemodelassistedriskofbiasassessmentinrandomizedcontrolledtrialsusingtherevisedriskofbiastoolusabilitystudy AT jinhuitian largelanguagemodelassistedriskofbiasassessmentinrandomizedcontrolledtrialsusingtherevisedriskofbiastoolusabilitystudy AT longge largelanguagemodelassistedriskofbiasassessmentinrandomizedcontrolledtrialsusingtherevisedriskofbiastoolusabilitystudy

Large Language Model–Assisted Risk-of-Bias Assessment in Randomized Controlled Trials Using the Revised Risk-of-Bias Tool: Usability Study

Similar Items