Systematic benchmarking of large Language models in programmed cell death-oriented gastric cancer research: a comparative analysis of DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5
Abstract Objectives We intended to compare three language models (DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5) regarding their ability to address programmed cell death mechanisms in gastric cancer. We aimed to establish which model most accurately reflects clinical standards and guidelines. Methods Fif...
Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2025-07-01
|
| Series: | Discover Oncology |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s12672-025-02911-7 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract Objectives We intended to compare three language models (DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5) regarding their ability to address programmed cell death mechanisms in gastric cancer. We aimed to establish which model most accurately reflects clinical standards and guidelines. Methods Fifty-five frequently posed questions and twenty guideline-oriented queries on cell death processes were collected from recognized gastroenterology and oncology resources. Each model received every question individually. Six independent specialists, each from a distinct hospital, rated responses from 1 to 10, and their scores were summed to a 60-point total. Answers achieving totals above 45 were classified as “good,” 30 to 45 as “moderate,” and below 30 as “poor.” Models that delivered “poor” replies received additional prompts for self‑correction, and revised answers underwent the same review. Results DeepSeek‑R1 showed higher total scores than the other two models in almost every topic, particularly for surgical protocols and multimodal therapies. Claude 3.5 ranked second, displaying mostly coherent coverage but occasionally omitting recent guideline updates. DeepSeek‑V3 had difficulty with intricate guideline-based material. In “poor” responses, DeepSeek‑R1 corrected errors markedly, shifting to a “good” rating upon re-evaluation, while DeepSeek‑V3 improved only marginally. Claude 3.5 consistently moved its “poor” answers up into the moderate range. Conclusion DeepSeek‑R1 demonstrated the strongest performance for clinical content linked to programmed cell death in gastric cancer, while Claude 3.5 performed moderately well. DeepSeek‑V3 proved adequate for more basic queries but lacked sufficient detail for advanced guideline-based scenarios. These findings highlight the potential and limitations of such automated models when applied in complex oncologic contexts. |
|---|---|
| ISSN: | 2730-6011 |