Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text

BackgroundMachine learning models can reduce the burden on doctors by converting medical records into International Classification of Diseases (ICD) codes in real time, thereby enhancing the efficiency of diagnosis and treatment. However, it faces challenges such as small dat...

Full description

Saved in:
Bibliographic Details
Main Authors: Yan Zhuang, Junyan Zhang, Xiuxing Li, Chao Liu, Yue Yu, Wei Dong, Kunlun He
Format: Article
Language:English
Published: JMIR Publications 2025-01-01
Series:JMIR Medical Informatics
Online Access:https://medinform.jmir.org/2025/1/e63020
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841557116550119424
author Yan Zhuang
Junyan Zhang
Xiuxing Li
Chao Liu
Yue Yu
Wei Dong
Kunlun He
author_facet Yan Zhuang
Junyan Zhang
Xiuxing Li
Chao Liu
Yue Yu
Wei Dong
Kunlun He
author_sort Yan Zhuang
collection DOAJ
description BackgroundMachine learning models can reduce the burden on doctors by converting medical records into International Classification of Diseases (ICD) codes in real time, thereby enhancing the efficiency of diagnosis and treatment. However, it faces challenges such as small datasets, diverse writing styles, unstructured records, and the need for semimanual preprocessing. Existing approaches, such as naive Bayes, Word2Vec, and convolutional neural networks, have limitations in handling missing values and understanding the context of medical texts, leading to a high error rate. We developed a fully automated pipeline based on the Key–bidirectional encoder representations from transformers (BERT) approach and large-scale medical records for continued pretraining, which effectively converts long free text into standard ICD codes. By adjusting parameter settings, such as mixed templates and soft verbalizers, the model can adapt flexibly to different requirements, enabling task-specific prompt learning. ObjectiveThis study aims to propose a prompt learning real-time framework based on pretrained language models that can automatically label long free-text data with ICD-10 codes for cardiovascular diseases without the need for semiautomatic preprocessing. MethodsWe integrated 4 components into our framework: a medical pretrained BERT, a keyword filtration BERT in a functional order, a fine-tuning phase, and task-specific prompt learning utilizing mixed templates and soft verbalizers. This framework was validated on a multicenter medical dataset for the automated ICD coding of 13 common cardiovascular diseases (584,969 records). Its performance was compared against robustly optimized BERT pretraining approach, extreme language network, and various BERT-based fine-tuning pipelines. Additionally, we evaluated the framework’s performance under different prompt learning and fine-tuning settings. Furthermore, few-shot learning experiments were conducted to assess the feasibility and efficacy of our framework in scenarios involving small- to mid-sized datasets. ResultsCompared with traditional pretraining and fine-tuning pipelines, our approach achieved a higher micro–F1-score of 0.838 and a macro–area under the receiver operating characteristic curve (macro-AUC) of 0.958, which is 10% higher than other methods. Among different prompt learning setups, the combination of mixed templates and soft verbalizers yielded the best performance. Few-shot experiments showed that performance stabilized and the AUC peaked at 500 shots. ConclusionsThese findings underscore the effectiveness and superior performance of prompt learning and fine-tuning for subtasks within pretrained language models in medical practice. Our real-time ICD coding pipeline efficiently converts detailed medical free text into standardized labels, offering promising applications in clinical decision-making. It can assist doctors unfamiliar with the ICD coding system in organizing medical record information, thereby accelerating the medical process and enhancing the efficiency of diagnosis and treatment.
format Article
id doaj-art-f7e33368b016434dae221c162908bf03
institution Kabale University
issn 2291-9694
language English
publishDate 2025-01-01
publisher JMIR Publications
record_format Article
series JMIR Medical Informatics
spelling doaj-art-f7e33368b016434dae221c162908bf032025-01-06T18:00:34ZengJMIR PublicationsJMIR Medical Informatics2291-96942025-01-0113e6302010.2196/63020Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical TextYan Zhuanghttps://orcid.org/0000-0003-3483-0988Junyan Zhanghttps://orcid.org/0000-0001-9140-901XXiuxing Lihttps://orcid.org/0000-0002-1178-7422Chao Liuhttps://orcid.org/0000-0002-8960-661XYue Yuhttps://orcid.org/0009-0000-3697-1405Wei Donghttps://orcid.org/0000-0003-4525-1105Kunlun Hehttps://orcid.org/0000-0002-3335-5700 BackgroundMachine learning models can reduce the burden on doctors by converting medical records into International Classification of Diseases (ICD) codes in real time, thereby enhancing the efficiency of diagnosis and treatment. However, it faces challenges such as small datasets, diverse writing styles, unstructured records, and the need for semimanual preprocessing. Existing approaches, such as naive Bayes, Word2Vec, and convolutional neural networks, have limitations in handling missing values and understanding the context of medical texts, leading to a high error rate. We developed a fully automated pipeline based on the Key–bidirectional encoder representations from transformers (BERT) approach and large-scale medical records for continued pretraining, which effectively converts long free text into standard ICD codes. By adjusting parameter settings, such as mixed templates and soft verbalizers, the model can adapt flexibly to different requirements, enabling task-specific prompt learning. ObjectiveThis study aims to propose a prompt learning real-time framework based on pretrained language models that can automatically label long free-text data with ICD-10 codes for cardiovascular diseases without the need for semiautomatic preprocessing. MethodsWe integrated 4 components into our framework: a medical pretrained BERT, a keyword filtration BERT in a functional order, a fine-tuning phase, and task-specific prompt learning utilizing mixed templates and soft verbalizers. This framework was validated on a multicenter medical dataset for the automated ICD coding of 13 common cardiovascular diseases (584,969 records). Its performance was compared against robustly optimized BERT pretraining approach, extreme language network, and various BERT-based fine-tuning pipelines. Additionally, we evaluated the framework’s performance under different prompt learning and fine-tuning settings. Furthermore, few-shot learning experiments were conducted to assess the feasibility and efficacy of our framework in scenarios involving small- to mid-sized datasets. ResultsCompared with traditional pretraining and fine-tuning pipelines, our approach achieved a higher micro–F1-score of 0.838 and a macro–area under the receiver operating characteristic curve (macro-AUC) of 0.958, which is 10% higher than other methods. Among different prompt learning setups, the combination of mixed templates and soft verbalizers yielded the best performance. Few-shot experiments showed that performance stabilized and the AUC peaked at 500 shots. ConclusionsThese findings underscore the effectiveness and superior performance of prompt learning and fine-tuning for subtasks within pretrained language models in medical practice. Our real-time ICD coding pipeline efficiently converts detailed medical free text into standardized labels, offering promising applications in clinical decision-making. It can assist doctors unfamiliar with the ICD coding system in organizing medical record information, thereby accelerating the medical process and enhancing the efficiency of diagnosis and treatment.https://medinform.jmir.org/2025/1/e63020
spellingShingle Yan Zhuang
Junyan Zhang
Xiuxing Li
Chao Liu
Yue Yu
Wei Dong
Kunlun He
Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text
JMIR Medical Informatics
title Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text
title_full Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text
title_fullStr Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text
title_full_unstemmed Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text
title_short Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text
title_sort autonomous international classification of diseases coding using pretrained language models and advanced prompt learning techniques evaluation of an automated analysis system using medical text
url https://medinform.jmir.org/2025/1/e63020
work_keys_str_mv AT yanzhuang autonomousinternationalclassificationofdiseasescodingusingpretrainedlanguagemodelsandadvancedpromptlearningtechniquesevaluationofanautomatedanalysissystemusingmedicaltext
AT junyanzhang autonomousinternationalclassificationofdiseasescodingusingpretrainedlanguagemodelsandadvancedpromptlearningtechniquesevaluationofanautomatedanalysissystemusingmedicaltext
AT xiuxingli autonomousinternationalclassificationofdiseasescodingusingpretrainedlanguagemodelsandadvancedpromptlearningtechniquesevaluationofanautomatedanalysissystemusingmedicaltext
AT chaoliu autonomousinternationalclassificationofdiseasescodingusingpretrainedlanguagemodelsandadvancedpromptlearningtechniquesevaluationofanautomatedanalysissystemusingmedicaltext
AT yueyu autonomousinternationalclassificationofdiseasescodingusingpretrainedlanguagemodelsandadvancedpromptlearningtechniquesevaluationofanautomatedanalysissystemusingmedicaltext
AT weidong autonomousinternationalclassificationofdiseasescodingusingpretrainedlanguagemodelsandadvancedpromptlearningtechniquesevaluationofanautomatedanalysissystemusingmedicaltext
AT kunlunhe autonomousinternationalclassificationofdiseasescodingusingpretrainedlanguagemodelsandadvancedpromptlearningtechniquesevaluationofanautomatedanalysissystemusingmedicaltext