Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text

BackgroundMachine learning models can reduce the burden on doctors by converting medical records into International Classification of Diseases (ICD) codes in real time, thereby enhancing the efficiency of diagnosis and treatment. However, it faces challenges such as small dat...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yan Zhuang, Junyan Zhang, Xiuxing Li, Chao Liu, Yue Yu, Wei Dong, Kunlun He
Format:	Article
Language:	English
Published:	JMIR Publications 2025-01-01
Series:	JMIR Medical Informatics
Online Access:	https://medinform.jmir.org/2025/1/e63020
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841557116550119424
author	Yan Zhuang Junyan Zhang Xiuxing Li Chao Liu Yue Yu Wei Dong Kunlun He
author_facet	Yan Zhuang Junyan Zhang Xiuxing Li Chao Liu Yue Yu Wei Dong Kunlun He
author_sort	Yan Zhuang
collection	DOAJ
description	BackgroundMachine learning models can reduce the burden on doctors by converting medical records into International Classification of Diseases (ICD) codes in real time, thereby enhancing the efficiency of diagnosis and treatment. However, it faces challenges such as small datasets, diverse writing styles, unstructured records, and the need for semimanual preprocessing. Existing approaches, such as naive Bayes, Word2Vec, and convolutional neural networks, have limitations in handling missing values and understanding the context of medical texts, leading to a high error rate. We developed a fully automated pipeline based on the Key–bidirectional encoder representations from transformers (BERT) approach and large-scale medical records for continued pretraining, which effectively converts long free text into standard ICD codes. By adjusting parameter settings, such as mixed templates and soft verbalizers, the model can adapt flexibly to different requirements, enabling task-specific prompt learning. ObjectiveThis study aims to propose a prompt learning real-time framework based on pretrained language models that can automatically label long free-text data with ICD-10 codes for cardiovascular diseases without the need for semiautomatic preprocessing. MethodsWe integrated 4 components into our framework: a medical pretrained BERT, a keyword filtration BERT in a functional order, a fine-tuning phase, and task-specific prompt learning utilizing mixed templates and soft verbalizers. This framework was validated on a multicenter medical dataset for the automated ICD coding of 13 common cardiovascular diseases (584,969 records). Its performance was compared against robustly optimized BERT pretraining approach, extreme language network, and various BERT-based fine-tuning pipelines. Additionally, we evaluated the framework’s performance under different prompt learning and fine-tuning settings. Furthermore, few-shot learning experiments were conducted to assess the feasibility and efficacy of our framework in scenarios involving small- to mid-sized datasets. ResultsCompared with traditional pretraining and fine-tuning pipelines, our approach achieved a higher micro–F1-score of 0.838 and a macro–area under the receiver operating characteristic curve (macro-AUC) of 0.958, which is 10% higher than other methods. Among different prompt learning setups, the combination of mixed templates and soft verbalizers yielded the best performance. Few-shot experiments showed that performance stabilized and the AUC peaked at 500 shots. ConclusionsThese findings underscore the effectiveness and superior performance of prompt learning and fine-tuning for subtasks within pretrained language models in medical practice. Our real-time ICD coding pipeline efficiently converts detailed medical free text into standardized labels, offering promising applications in clinical decision-making. It can assist doctors unfamiliar with the ICD coding system in organizing medical record information, thereby accelerating the medical process and enhancing the efficiency of diagnosis and treatment.
format	Article
id	doaj-art-f7e33368b016434dae221c162908bf03
institution	Kabale University
issn	2291-9694
language	English
publishDate	2025-01-01
publisher	JMIR Publications
record_format	Article
series	JMIR Medical Informatics
spelling	doaj-art-f7e33368b016434dae221c162908bf032025-01-06T18:00:34ZengJMIR PublicationsJMIR Medical Informatics2291-96942025-01-0113e6302010.2196/63020Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical TextYan Zhuanghttps://orcid.org/0000-0003-3483-0988Junyan Zhanghttps://orcid.org/0000-0001-9140-901XXiuxing Lihttps://orcid.org/0000-0002-1178-7422Chao Liuhttps://orcid.org/0000-0002-8960-661XYue Yuhttps://orcid.org/0009-0000-3697-1405Wei Donghttps://orcid.org/0000-0003-4525-1105Kunlun Hehttps://orcid.org/0000-0002-3335-5700 BackgroundMachine learning models can reduce the burden on doctors by converting medical records into International Classification of Diseases (ICD) codes in real time, thereby enhancing the efficiency of diagnosis and treatment. However, it faces challenges such as small datasets, diverse writing styles, unstructured records, and the need for semimanual preprocessing. Existing approaches, such as naive Bayes, Word2Vec, and convolutional neural networks, have limitations in handling missing values and understanding the context of medical texts, leading to a high error rate. We developed a fully automated pipeline based on the Key–bidirectional encoder representations from transformers (BERT) approach and large-scale medical records for continued pretraining, which effectively converts long free text into standard ICD codes. By adjusting parameter settings, such as mixed templates and soft verbalizers, the model can adapt flexibly to different requirements, enabling task-specific prompt learning. ObjectiveThis study aims to propose a prompt learning real-time framework based on pretrained language models that can automatically label long free-text data with ICD-10 codes for cardiovascular diseases without the need for semiautomatic preprocessing. MethodsWe integrated 4 components into our framework: a medical pretrained BERT, a keyword filtration BERT in a functional order, a fine-tuning phase, and task-specific prompt learning utilizing mixed templates and soft verbalizers. This framework was validated on a multicenter medical dataset for the automated ICD coding of 13 common cardiovascular diseases (584,969 records). Its performance was compared against robustly optimized BERT pretraining approach, extreme language network, and various BERT-based fine-tuning pipelines. Additionally, we evaluated the framework’s performance under different prompt learning and fine-tuning settings. Furthermore, few-shot learning experiments were conducted to assess the feasibility and efficacy of our framework in scenarios involving small- to mid-sized datasets. ResultsCompared with traditional pretraining and fine-tuning pipelines, our approach achieved a higher micro–F1-score of 0.838 and a macro–area under the receiver operating characteristic curve (macro-AUC) of 0.958, which is 10% higher than other methods. Among different prompt learning setups, the combination of mixed templates and soft verbalizers yielded the best performance. Few-shot experiments showed that performance stabilized and the AUC peaked at 500 shots. ConclusionsThese findings underscore the effectiveness and superior performance of prompt learning and fine-tuning for subtasks within pretrained language models in medical practice. Our real-time ICD coding pipeline efficiently converts detailed medical free text into standardized labels, offering promising applications in clinical decision-making. It can assist doctors unfamiliar with the ICD coding system in organizing medical record information, thereby accelerating the medical process and enhancing the efficiency of diagnosis and treatment.https://medinform.jmir.org/2025/1/e63020
spellingShingle	Yan Zhuang Junyan Zhang Xiuxing Li Chao Liu Yue Yu Wei Dong Kunlun He Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text JMIR Medical Informatics
title	Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text
title_full	Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text
title_fullStr	Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text
title_full_unstemmed	Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text
title_short	Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text
title_sort	autonomous international classification of diseases coding using pretrained language models and advanced prompt learning techniques evaluation of an automated analysis system using medical text
url	https://medinform.jmir.org/2025/1/e63020
work_keys_str_mv	AT yanzhuang autonomousinternationalclassificationofdiseasescodingusingpretrainedlanguagemodelsandadvancedpromptlearningtechniquesevaluationofanautomatedanalysissystemusingmedicaltext AT junyanzhang autonomousinternationalclassificationofdiseasescodingusingpretrainedlanguagemodelsandadvancedpromptlearningtechniquesevaluationofanautomatedanalysissystemusingmedicaltext AT xiuxingli autonomousinternationalclassificationofdiseasescodingusingpretrainedlanguagemodelsandadvancedpromptlearningtechniquesevaluationofanautomatedanalysissystemusingmedicaltext AT chaoliu autonomousinternationalclassificationofdiseasescodingusingpretrainedlanguagemodelsandadvancedpromptlearningtechniquesevaluationofanautomatedanalysissystemusingmedicaltext AT yueyu autonomousinternationalclassificationofdiseasescodingusingpretrainedlanguagemodelsandadvancedpromptlearningtechniquesevaluationofanautomatedanalysissystemusingmedicaltext AT weidong autonomousinternationalclassificationofdiseasescodingusingpretrainedlanguagemodelsandadvancedpromptlearningtechniquesevaluationofanautomatedanalysissystemusingmedicaltext AT kunlunhe autonomousinternationalclassificationofdiseasescodingusingpretrainedlanguagemodelsandadvancedpromptlearningtechniquesevaluationofanautomatedanalysissystemusingmedicaltext

Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text

Similar Items