Systematic benchmarking of large Language models in programmed cell death-oriented gastric cancer research: a comparative analysis of DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5
Abstract Objectives We intended to compare three language models (DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5) regarding their ability to address programmed cell death mechanisms in gastric cancer. We aimed to establish which model most accurately reflects clinical standards and guidelines. Methods Fif...
Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2025-07-01
|
| Series: | Discover Oncology |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s12672-025-02911-7 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849238626813083648 |
|---|---|
| author | Yuheng Li Jiaqi Dong Dongdong Liu Yuqing Huang Yan Jiang Liangchao Chen Qiming Gong |
| author_facet | Yuheng Li Jiaqi Dong Dongdong Liu Yuqing Huang Yan Jiang Liangchao Chen Qiming Gong |
| author_sort | Yuheng Li |
| collection | DOAJ |
| description | Abstract Objectives We intended to compare three language models (DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5) regarding their ability to address programmed cell death mechanisms in gastric cancer. We aimed to establish which model most accurately reflects clinical standards and guidelines. Methods Fifty-five frequently posed questions and twenty guideline-oriented queries on cell death processes were collected from recognized gastroenterology and oncology resources. Each model received every question individually. Six independent specialists, each from a distinct hospital, rated responses from 1 to 10, and their scores were summed to a 60-point total. Answers achieving totals above 45 were classified as “good,” 30 to 45 as “moderate,” and below 30 as “poor.” Models that delivered “poor” replies received additional prompts for self‑correction, and revised answers underwent the same review. Results DeepSeek‑R1 showed higher total scores than the other two models in almost every topic, particularly for surgical protocols and multimodal therapies. Claude 3.5 ranked second, displaying mostly coherent coverage but occasionally omitting recent guideline updates. DeepSeek‑V3 had difficulty with intricate guideline-based material. In “poor” responses, DeepSeek‑R1 corrected errors markedly, shifting to a “good” rating upon re-evaluation, while DeepSeek‑V3 improved only marginally. Claude 3.5 consistently moved its “poor” answers up into the moderate range. Conclusion DeepSeek‑R1 demonstrated the strongest performance for clinical content linked to programmed cell death in gastric cancer, while Claude 3.5 performed moderately well. DeepSeek‑V3 proved adequate for more basic queries but lacked sufficient detail for advanced guideline-based scenarios. These findings highlight the potential and limitations of such automated models when applied in complex oncologic contexts. |
| format | Article |
| id | doaj-art-78551a1712484fcd8eb4c83191d43ef2 |
| institution | Kabale University |
| issn | 2730-6011 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | Springer |
| record_format | Article |
| series | Discover Oncology |
| spelling | doaj-art-78551a1712484fcd8eb4c83191d43ef22025-08-20T04:01:34ZengSpringerDiscover Oncology2730-60112025-07-0116111410.1007/s12672-025-02911-7Systematic benchmarking of large Language models in programmed cell death-oriented gastric cancer research: a comparative analysis of DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5Yuheng Li0Jiaqi Dong1Dongdong Liu2Yuqing Huang3Yan Jiang4Liangchao Chen5Qiming Gong6Department of General Surgery, Luzhou People’s HospitalDepartment of Gastroenterology, Deyang People’s HospitalDepartment of Laboratory Medicine, Deyang People’s HospitalDepartment of Nephrology, Affiliated Hospital of Youjiang Medical University for NationalitiesDepartment of Nephrology, Affiliated Hospital of Youjiang Medical University for NationalitiesDepartment of Oncology, Xichong People’s HospitalDepartment of Nephrology, Affiliated Hospital of Youjiang Medical University for NationalitiesAbstract Objectives We intended to compare three language models (DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5) regarding their ability to address programmed cell death mechanisms in gastric cancer. We aimed to establish which model most accurately reflects clinical standards and guidelines. Methods Fifty-five frequently posed questions and twenty guideline-oriented queries on cell death processes were collected from recognized gastroenterology and oncology resources. Each model received every question individually. Six independent specialists, each from a distinct hospital, rated responses from 1 to 10, and their scores were summed to a 60-point total. Answers achieving totals above 45 were classified as “good,” 30 to 45 as “moderate,” and below 30 as “poor.” Models that delivered “poor” replies received additional prompts for self‑correction, and revised answers underwent the same review. Results DeepSeek‑R1 showed higher total scores than the other two models in almost every topic, particularly for surgical protocols and multimodal therapies. Claude 3.5 ranked second, displaying mostly coherent coverage but occasionally omitting recent guideline updates. DeepSeek‑V3 had difficulty with intricate guideline-based material. In “poor” responses, DeepSeek‑R1 corrected errors markedly, shifting to a “good” rating upon re-evaluation, while DeepSeek‑V3 improved only marginally. Claude 3.5 consistently moved its “poor” answers up into the moderate range. Conclusion DeepSeek‑R1 demonstrated the strongest performance for clinical content linked to programmed cell death in gastric cancer, while Claude 3.5 performed moderately well. DeepSeek‑V3 proved adequate for more basic queries but lacked sufficient detail for advanced guideline-based scenarios. These findings highlight the potential and limitations of such automated models when applied in complex oncologic contexts.https://doi.org/10.1007/s12672-025-02911-7Gastric cancerProgrammed cell deathApoptosisNecroptosisModel reliability |
| spellingShingle | Yuheng Li Jiaqi Dong Dongdong Liu Yuqing Huang Yan Jiang Liangchao Chen Qiming Gong Systematic benchmarking of large Language models in programmed cell death-oriented gastric cancer research: a comparative analysis of DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5 Discover Oncology Gastric cancer Programmed cell death Apoptosis Necroptosis Model reliability |
| title | Systematic benchmarking of large Language models in programmed cell death-oriented gastric cancer research: a comparative analysis of DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5 |
| title_full | Systematic benchmarking of large Language models in programmed cell death-oriented gastric cancer research: a comparative analysis of DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5 |
| title_fullStr | Systematic benchmarking of large Language models in programmed cell death-oriented gastric cancer research: a comparative analysis of DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5 |
| title_full_unstemmed | Systematic benchmarking of large Language models in programmed cell death-oriented gastric cancer research: a comparative analysis of DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5 |
| title_short | Systematic benchmarking of large Language models in programmed cell death-oriented gastric cancer research: a comparative analysis of DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5 |
| title_sort | systematic benchmarking of large language models in programmed cell death oriented gastric cancer research a comparative analysis of deepseek v3 deepseek r1 and claude 3 5 |
| topic | Gastric cancer Programmed cell death Apoptosis Necroptosis Model reliability |
| url | https://doi.org/10.1007/s12672-025-02911-7 |
| work_keys_str_mv | AT yuhengli systematicbenchmarkingoflargelanguagemodelsinprogrammedcelldeathorientedgastriccancerresearchacomparativeanalysisofdeepseekv3deepseekr1andclaude35 AT jiaqidong systematicbenchmarkingoflargelanguagemodelsinprogrammedcelldeathorientedgastriccancerresearchacomparativeanalysisofdeepseekv3deepseekr1andclaude35 AT dongdongliu systematicbenchmarkingoflargelanguagemodelsinprogrammedcelldeathorientedgastriccancerresearchacomparativeanalysisofdeepseekv3deepseekr1andclaude35 AT yuqinghuang systematicbenchmarkingoflargelanguagemodelsinprogrammedcelldeathorientedgastriccancerresearchacomparativeanalysisofdeepseekv3deepseekr1andclaude35 AT yanjiang systematicbenchmarkingoflargelanguagemodelsinprogrammedcelldeathorientedgastriccancerresearchacomparativeanalysisofdeepseekv3deepseekr1andclaude35 AT liangchaochen systematicbenchmarkingoflargelanguagemodelsinprogrammedcelldeathorientedgastriccancerresearchacomparativeanalysisofdeepseekv3deepseekr1andclaude35 AT qiminggong systematicbenchmarkingoflargelanguagemodelsinprogrammedcelldeathorientedgastriccancerresearchacomparativeanalysisofdeepseekv3deepseekr1andclaude35 |