MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

Ensuring the general efficacy and benefit for human beings from medical Large Language Models (LLM) before real-world deployment is crucial. However, a widely accepted and accessible evaluation process for medical LLM, especially in the Chinese context, remains to be established. In this work, we in...

Full description

Saved in:
Bibliographic Details
Main Authors: Mianxin Liu, Weiguo Hu, Jinru Ding, Jie Xu, Xiaoyang Li, Lifeng Zhu, Zhian Bai, Xiaoming Shi, Benyou Wang, Haitao Song, Pengfei Liu, Xiaofan Zhang, Shanshan Wang, Kang Li, Haofen Wang, Tong Ruan, Xuanjing Huang, Xin Sun, Shaoting Zhang
Format: Article
Language:English
Published: Tsinghua University Press 2024-12-01
Series:Big Data Mining and Analytics
Subjects:
Online Access:https://www.sciopen.com/article/10.26599/BDMA.2024.9020044
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849685560176672768
author Mianxin Liu
Weiguo Hu
Jinru Ding
Jie Xu
Xiaoyang Li
Lifeng Zhu
Zhian Bai
Xiaoming Shi
Benyou Wang
Haitao Song
Pengfei Liu
Xiaofan Zhang
Shanshan Wang
Kang Li
Haofen Wang
Tong Ruan
Xuanjing Huang
Xin Sun
Shaoting Zhang
author_facet Mianxin Liu
Weiguo Hu
Jinru Ding
Jie Xu
Xiaoyang Li
Lifeng Zhu
Zhian Bai
Xiaoming Shi
Benyou Wang
Haitao Song
Pengfei Liu
Xiaofan Zhang
Shanshan Wang
Kang Li
Haofen Wang
Tong Ruan
Xuanjing Huang
Xin Sun
Shaoting Zhang
author_sort Mianxin Liu
collection DOAJ
description Ensuring the general efficacy and benefit for human beings from medical Large Language Models (LLM) before real-world deployment is crucial. However, a widely accepted and accessible evaluation process for medical LLM, especially in the Chinese context, remains to be established. In this work, we introduce “MedBench”, a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM. First, MedBench assembles the currently largest evaluation dataset (300901 questions) to cover 43 clinical specialties, and performs multi-faceted evaluation on medical LLM. Second, MedBench provides a standardized and fully automatic cloud-based evaluation infrastructure, with physical separations between question and ground truth. Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer memorization. Applying MedBench to popular general and medical LLMs, we observe unbiased, reproducible evaluation results largely aligning with medical professionals’ perspectives. This study establishes a significant foundation for preparing the practical applications of Chinese medical LLMs. MedBench is publicly accessible at https://medbench.opencompass.org.cn.
format Article
id doaj-art-c4beb5096bd7443da21b01a4a4922d29
institution DOAJ
issn 2096-0654
language English
publishDate 2024-12-01
publisher Tsinghua University Press
record_format Article
series Big Data Mining and Analytics
spelling doaj-art-c4beb5096bd7443da21b01a4a4922d292025-08-20T03:23:06ZengTsinghua University PressBig Data Mining and Analytics2096-06542024-12-01741116112810.26599/BDMA.2024.9020044MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language ModelsMianxin Liu0Weiguo Hu1Jinru Ding2Jie Xu3Xiaoyang Li4Lifeng Zhu5Zhian Bai6Xiaoming Shi7Benyou Wang8Haitao Song9Pengfei Liu10Xiaofan Zhang11Shanshan Wang12Kang Li13Haofen Wang14Tong Ruan15Xuanjing Huang16Xin Sun17Shaoting Zhang18Shanghai Artificial Intelligence Laboratory, Shanghai 200232, ChinaRuijin Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200025, ChinaShanghai Artificial Intelligence Laboratory, Shanghai 200232, ChinaShanghai Artificial Intelligence Laboratory, Shanghai 200232, ChinaRuijin Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200025, ChinaRuijin Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200025, ChinaRuijin Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200025, ChinaShanghai Artificial Intelligence Laboratory, Shanghai 200232, ChinaChinese University of Hong Kong, Shenzhen 518172, ChinaShanghai Artificial Intelligence Research Institute, Shanghai 200240, and also with Shanghai Jiao Tong University, Shanghai 200240, ChinaSchool of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, ChinaQing Yuan Research Institute, Shanghai Jiao Tong University, Shanghai 200240, ChinaShenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, ChinaWest China Hospital, Sichuan University, Chengdu 610041, ChinaSchool of Design and Innovation, Tongji University, Shanghai 200092, ChinaDepartment of Computer Science and Technology, East China University of Science and Technology, Shanghai 200237, ChinaSchool of Computer Science, Fudan University, Shanghai 200433, ChinaXinhua Hospital Affiliated to Shanghai Jiaotong University School of Medicine, Shanghai 200092, ChinaShanghai Artificial Intelligence Laboratory, Shanghai 200232, ChinaEnsuring the general efficacy and benefit for human beings from medical Large Language Models (LLM) before real-world deployment is crucial. However, a widely accepted and accessible evaluation process for medical LLM, especially in the Chinese context, remains to be established. In this work, we introduce “MedBench”, a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM. First, MedBench assembles the currently largest evaluation dataset (300901 questions) to cover 43 clinical specialties, and performs multi-faceted evaluation on medical LLM. Second, MedBench provides a standardized and fully automatic cloud-based evaluation infrastructure, with physical separations between question and ground truth. Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer memorization. Applying MedBench to popular general and medical LLMs, we observe unbiased, reproducible evaluation results largely aligning with medical professionals’ perspectives. This study establishes a significant foundation for preparing the practical applications of Chinese medical LLMs. MedBench is publicly accessible at https://medbench.opencompass.org.cn.https://www.sciopen.com/article/10.26599/BDMA.2024.9020044medical large language model (mllm)benchmarkplatformopen-source
spellingShingle Mianxin Liu
Weiguo Hu
Jinru Ding
Jie Xu
Xiaoyang Li
Lifeng Zhu
Zhian Bai
Xiaoming Shi
Benyou Wang
Haitao Song
Pengfei Liu
Xiaofan Zhang
Shanshan Wang
Kang Li
Haofen Wang
Tong Ruan
Xuanjing Huang
Xin Sun
Shaoting Zhang
MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models
Big Data Mining and Analytics
medical large language model (mllm)
benchmark
platform
open-source
title MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models
title_full MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models
title_fullStr MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models
title_full_unstemmed MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models
title_short MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models
title_sort medbench a comprehensive standardized and reliable benchmarking system for evaluating chinese medical large language models
topic medical large language model (mllm)
benchmark
platform
open-source
url https://www.sciopen.com/article/10.26599/BDMA.2024.9020044
work_keys_str_mv AT mianxinliu medbenchacomprehensivestandardizedandreliablebenchmarkingsystemforevaluatingchinesemedicallargelanguagemodels
AT weiguohu medbenchacomprehensivestandardizedandreliablebenchmarkingsystemforevaluatingchinesemedicallargelanguagemodels
AT jinruding medbenchacomprehensivestandardizedandreliablebenchmarkingsystemforevaluatingchinesemedicallargelanguagemodels
AT jiexu medbenchacomprehensivestandardizedandreliablebenchmarkingsystemforevaluatingchinesemedicallargelanguagemodels
AT xiaoyangli medbenchacomprehensivestandardizedandreliablebenchmarkingsystemforevaluatingchinesemedicallargelanguagemodels
AT lifengzhu medbenchacomprehensivestandardizedandreliablebenchmarkingsystemforevaluatingchinesemedicallargelanguagemodels
AT zhianbai medbenchacomprehensivestandardizedandreliablebenchmarkingsystemforevaluatingchinesemedicallargelanguagemodels
AT xiaomingshi medbenchacomprehensivestandardizedandreliablebenchmarkingsystemforevaluatingchinesemedicallargelanguagemodels
AT benyouwang medbenchacomprehensivestandardizedandreliablebenchmarkingsystemforevaluatingchinesemedicallargelanguagemodels
AT haitaosong medbenchacomprehensivestandardizedandreliablebenchmarkingsystemforevaluatingchinesemedicallargelanguagemodels
AT pengfeiliu medbenchacomprehensivestandardizedandreliablebenchmarkingsystemforevaluatingchinesemedicallargelanguagemodels
AT xiaofanzhang medbenchacomprehensivestandardizedandreliablebenchmarkingsystemforevaluatingchinesemedicallargelanguagemodels
AT shanshanwang medbenchacomprehensivestandardizedandreliablebenchmarkingsystemforevaluatingchinesemedicallargelanguagemodels
AT kangli medbenchacomprehensivestandardizedandreliablebenchmarkingsystemforevaluatingchinesemedicallargelanguagemodels
AT haofenwang medbenchacomprehensivestandardizedandreliablebenchmarkingsystemforevaluatingchinesemedicallargelanguagemodels
AT tongruan medbenchacomprehensivestandardizedandreliablebenchmarkingsystemforevaluatingchinesemedicallargelanguagemodels
AT xuanjinghuang medbenchacomprehensivestandardizedandreliablebenchmarkingsystemforevaluatingchinesemedicallargelanguagemodels
AT xinsun medbenchacomprehensivestandardizedandreliablebenchmarkingsystemforevaluatingchinesemedicallargelanguagemodels
AT shaotingzhang medbenchacomprehensivestandardizedandreliablebenchmarkingsystemforevaluatingchinesemedicallargelanguagemodels