Systematic benchmarking of large Language models in programmed cell death-oriented gastric cancer research: a comparative analysis of DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5

Abstract Objectives We intended to compare three language models (DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5) regarding their ability to address programmed cell death mechanisms in gastric cancer. We aimed to establish which model most accurately reflects clinical standards and guidelines. Methods Fif...

Full description

Saved in:
Bibliographic Details
Main Authors: Yuheng Li, Jiaqi Dong, Dongdong Liu, Yuqing Huang, Yan Jiang, Liangchao Chen, Qiming Gong
Format: Article
Language:English
Published: Springer 2025-07-01
Series:Discover Oncology
Subjects:
Online Access:https://doi.org/10.1007/s12672-025-02911-7
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849238626813083648
author Yuheng Li
Jiaqi Dong
Dongdong Liu
Yuqing Huang
Yan Jiang
Liangchao Chen
Qiming Gong
author_facet Yuheng Li
Jiaqi Dong
Dongdong Liu
Yuqing Huang
Yan Jiang
Liangchao Chen
Qiming Gong
author_sort Yuheng Li
collection DOAJ
description Abstract Objectives We intended to compare three language models (DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5) regarding their ability to address programmed cell death mechanisms in gastric cancer. We aimed to establish which model most accurately reflects clinical standards and guidelines. Methods Fifty-five frequently posed questions and twenty guideline-oriented queries on cell death processes were collected from recognized gastroenterology and oncology resources. Each model received every question individually. Six independent specialists, each from a distinct hospital, rated responses from 1 to 10, and their scores were summed to a 60-point total. Answers achieving totals above 45 were classified as “good,” 30 to 45 as “moderate,” and below 30 as “poor.” Models that delivered “poor” replies received additional prompts for self‑correction, and revised answers underwent the same review. Results DeepSeek‑R1 showed higher total scores than the other two models in almost every topic, particularly for surgical protocols and multimodal therapies. Claude 3.5 ranked second, displaying mostly coherent coverage but occasionally omitting recent guideline updates. DeepSeek‑V3 had difficulty with intricate guideline-based material. In “poor” responses, DeepSeek‑R1 corrected errors markedly, shifting to a “good” rating upon re-evaluation, while DeepSeek‑V3 improved only marginally. Claude 3.5 consistently moved its “poor” answers up into the moderate range. Conclusion DeepSeek‑R1 demonstrated the strongest performance for clinical content linked to programmed cell death in gastric cancer, while Claude 3.5 performed moderately well. DeepSeek‑V3 proved adequate for more basic queries but lacked sufficient detail for advanced guideline-based scenarios. These findings highlight the potential and limitations of such automated models when applied in complex oncologic contexts.
format Article
id doaj-art-78551a1712484fcd8eb4c83191d43ef2
institution Kabale University
issn 2730-6011
language English
publishDate 2025-07-01
publisher Springer
record_format Article
series Discover Oncology
spelling doaj-art-78551a1712484fcd8eb4c83191d43ef22025-08-20T04:01:34ZengSpringerDiscover Oncology2730-60112025-07-0116111410.1007/s12672-025-02911-7Systematic benchmarking of large Language models in programmed cell death-oriented gastric cancer research: a comparative analysis of DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5Yuheng Li0Jiaqi Dong1Dongdong Liu2Yuqing Huang3Yan Jiang4Liangchao Chen5Qiming Gong6Department of General Surgery, Luzhou People’s HospitalDepartment of Gastroenterology, Deyang People’s HospitalDepartment of Laboratory Medicine, Deyang People’s HospitalDepartment of Nephrology, Affiliated Hospital of Youjiang Medical University for NationalitiesDepartment of Nephrology, Affiliated Hospital of Youjiang Medical University for NationalitiesDepartment of Oncology, Xichong People’s HospitalDepartment of Nephrology, Affiliated Hospital of Youjiang Medical University for NationalitiesAbstract Objectives We intended to compare three language models (DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5) regarding their ability to address programmed cell death mechanisms in gastric cancer. We aimed to establish which model most accurately reflects clinical standards and guidelines. Methods Fifty-five frequently posed questions and twenty guideline-oriented queries on cell death processes were collected from recognized gastroenterology and oncology resources. Each model received every question individually. Six independent specialists, each from a distinct hospital, rated responses from 1 to 10, and their scores were summed to a 60-point total. Answers achieving totals above 45 were classified as “good,” 30 to 45 as “moderate,” and below 30 as “poor.” Models that delivered “poor” replies received additional prompts for self‑correction, and revised answers underwent the same review. Results DeepSeek‑R1 showed higher total scores than the other two models in almost every topic, particularly for surgical protocols and multimodal therapies. Claude 3.5 ranked second, displaying mostly coherent coverage but occasionally omitting recent guideline updates. DeepSeek‑V3 had difficulty with intricate guideline-based material. In “poor” responses, DeepSeek‑R1 corrected errors markedly, shifting to a “good” rating upon re-evaluation, while DeepSeek‑V3 improved only marginally. Claude 3.5 consistently moved its “poor” answers up into the moderate range. Conclusion DeepSeek‑R1 demonstrated the strongest performance for clinical content linked to programmed cell death in gastric cancer, while Claude 3.5 performed moderately well. DeepSeek‑V3 proved adequate for more basic queries but lacked sufficient detail for advanced guideline-based scenarios. These findings highlight the potential and limitations of such automated models when applied in complex oncologic contexts.https://doi.org/10.1007/s12672-025-02911-7Gastric cancerProgrammed cell deathApoptosisNecroptosisModel reliability
spellingShingle Yuheng Li
Jiaqi Dong
Dongdong Liu
Yuqing Huang
Yan Jiang
Liangchao Chen
Qiming Gong
Systematic benchmarking of large Language models in programmed cell death-oriented gastric cancer research: a comparative analysis of DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5
Discover Oncology
Gastric cancer
Programmed cell death
Apoptosis
Necroptosis
Model reliability
title Systematic benchmarking of large Language models in programmed cell death-oriented gastric cancer research: a comparative analysis of DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5
title_full Systematic benchmarking of large Language models in programmed cell death-oriented gastric cancer research: a comparative analysis of DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5
title_fullStr Systematic benchmarking of large Language models in programmed cell death-oriented gastric cancer research: a comparative analysis of DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5
title_full_unstemmed Systematic benchmarking of large Language models in programmed cell death-oriented gastric cancer research: a comparative analysis of DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5
title_short Systematic benchmarking of large Language models in programmed cell death-oriented gastric cancer research: a comparative analysis of DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5
title_sort systematic benchmarking of large language models in programmed cell death oriented gastric cancer research a comparative analysis of deepseek v3 deepseek r1 and claude 3 5
topic Gastric cancer
Programmed cell death
Apoptosis
Necroptosis
Model reliability
url https://doi.org/10.1007/s12672-025-02911-7
work_keys_str_mv AT yuhengli systematicbenchmarkingoflargelanguagemodelsinprogrammedcelldeathorientedgastriccancerresearchacomparativeanalysisofdeepseekv3deepseekr1andclaude35
AT jiaqidong systematicbenchmarkingoflargelanguagemodelsinprogrammedcelldeathorientedgastriccancerresearchacomparativeanalysisofdeepseekv3deepseekr1andclaude35
AT dongdongliu systematicbenchmarkingoflargelanguagemodelsinprogrammedcelldeathorientedgastriccancerresearchacomparativeanalysisofdeepseekv3deepseekr1andclaude35
AT yuqinghuang systematicbenchmarkingoflargelanguagemodelsinprogrammedcelldeathorientedgastriccancerresearchacomparativeanalysisofdeepseekv3deepseekr1andclaude35
AT yanjiang systematicbenchmarkingoflargelanguagemodelsinprogrammedcelldeathorientedgastriccancerresearchacomparativeanalysisofdeepseekv3deepseekr1andclaude35
AT liangchaochen systematicbenchmarkingoflargelanguagemodelsinprogrammedcelldeathorientedgastriccancerresearchacomparativeanalysisofdeepseekv3deepseekr1andclaude35
AT qiminggong systematicbenchmarkingoflargelanguagemodelsinprogrammedcelldeathorientedgastriccancerresearchacomparativeanalysisofdeepseekv3deepseekr1andclaude35