Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text
BackgroundMachine learning models can reduce the burden on doctors by converting medical records into International Classification of Diseases (ICD) codes in real time, thereby enhancing the efficiency of diagnosis and treatment. However, it faces challenges such as small dat...
Saved in:
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
JMIR Publications
2025-01-01
|
Series: | JMIR Medical Informatics |
Online Access: | https://medinform.jmir.org/2025/1/e63020 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841557116550119424 |
---|---|
author | Yan Zhuang Junyan Zhang Xiuxing Li Chao Liu Yue Yu Wei Dong Kunlun He |
author_facet | Yan Zhuang Junyan Zhang Xiuxing Li Chao Liu Yue Yu Wei Dong Kunlun He |
author_sort | Yan Zhuang |
collection | DOAJ |
description |
BackgroundMachine learning models can reduce the burden on doctors by converting medical records into International Classification of Diseases (ICD) codes in real time, thereby enhancing the efficiency of diagnosis and treatment. However, it faces challenges such as small datasets, diverse writing styles, unstructured records, and the need for semimanual preprocessing. Existing approaches, such as naive Bayes, Word2Vec, and convolutional neural networks, have limitations in handling missing values and understanding the context of medical texts, leading to a high error rate. We developed a fully automated pipeline based on the Key–bidirectional encoder representations from transformers (BERT) approach and large-scale medical records for continued pretraining, which effectively converts long free text into standard ICD codes. By adjusting parameter settings, such as mixed templates and soft verbalizers, the model can adapt flexibly to different requirements, enabling task-specific prompt learning.
ObjectiveThis study aims to propose a prompt learning real-time framework based on pretrained language models that can automatically label long free-text data with ICD-10 codes for cardiovascular diseases without the need for semiautomatic preprocessing.
MethodsWe integrated 4 components into our framework: a medical pretrained BERT, a keyword filtration BERT in a functional order, a fine-tuning phase, and task-specific prompt learning utilizing mixed templates and soft verbalizers. This framework was validated on a multicenter medical dataset for the automated ICD coding of 13 common cardiovascular diseases (584,969 records). Its performance was compared against robustly optimized BERT pretraining approach, extreme language network, and various BERT-based fine-tuning pipelines. Additionally, we evaluated the framework’s performance under different prompt learning and fine-tuning settings. Furthermore, few-shot learning experiments were conducted to assess the feasibility and efficacy of our framework in scenarios involving small- to mid-sized datasets.
ResultsCompared with traditional pretraining and fine-tuning pipelines, our approach achieved a higher micro–F1-score of 0.838 and a macro–area under the receiver operating characteristic curve (macro-AUC) of 0.958, which is 10% higher than other methods. Among different prompt learning setups, the combination of mixed templates and soft verbalizers yielded the best performance. Few-shot experiments showed that performance stabilized and the AUC peaked at 500 shots.
ConclusionsThese findings underscore the effectiveness and superior performance of prompt learning and fine-tuning for subtasks within pretrained language models in medical practice. Our real-time ICD coding pipeline efficiently converts detailed medical free text into standardized labels, offering promising applications in clinical decision-making. It can assist doctors unfamiliar with the ICD coding system in organizing medical record information, thereby accelerating the medical process and enhancing the efficiency of diagnosis and treatment. |
format | Article |
id | doaj-art-f7e33368b016434dae221c162908bf03 |
institution | Kabale University |
issn | 2291-9694 |
language | English |
publishDate | 2025-01-01 |
publisher | JMIR Publications |
record_format | Article |
series | JMIR Medical Informatics |
spelling | doaj-art-f7e33368b016434dae221c162908bf032025-01-06T18:00:34ZengJMIR PublicationsJMIR Medical Informatics2291-96942025-01-0113e6302010.2196/63020Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical TextYan Zhuanghttps://orcid.org/0000-0003-3483-0988Junyan Zhanghttps://orcid.org/0000-0001-9140-901XXiuxing Lihttps://orcid.org/0000-0002-1178-7422Chao Liuhttps://orcid.org/0000-0002-8960-661XYue Yuhttps://orcid.org/0009-0000-3697-1405Wei Donghttps://orcid.org/0000-0003-4525-1105Kunlun Hehttps://orcid.org/0000-0002-3335-5700 BackgroundMachine learning models can reduce the burden on doctors by converting medical records into International Classification of Diseases (ICD) codes in real time, thereby enhancing the efficiency of diagnosis and treatment. However, it faces challenges such as small datasets, diverse writing styles, unstructured records, and the need for semimanual preprocessing. Existing approaches, such as naive Bayes, Word2Vec, and convolutional neural networks, have limitations in handling missing values and understanding the context of medical texts, leading to a high error rate. We developed a fully automated pipeline based on the Key–bidirectional encoder representations from transformers (BERT) approach and large-scale medical records for continued pretraining, which effectively converts long free text into standard ICD codes. By adjusting parameter settings, such as mixed templates and soft verbalizers, the model can adapt flexibly to different requirements, enabling task-specific prompt learning. ObjectiveThis study aims to propose a prompt learning real-time framework based on pretrained language models that can automatically label long free-text data with ICD-10 codes for cardiovascular diseases without the need for semiautomatic preprocessing. MethodsWe integrated 4 components into our framework: a medical pretrained BERT, a keyword filtration BERT in a functional order, a fine-tuning phase, and task-specific prompt learning utilizing mixed templates and soft verbalizers. This framework was validated on a multicenter medical dataset for the automated ICD coding of 13 common cardiovascular diseases (584,969 records). Its performance was compared against robustly optimized BERT pretraining approach, extreme language network, and various BERT-based fine-tuning pipelines. Additionally, we evaluated the framework’s performance under different prompt learning and fine-tuning settings. Furthermore, few-shot learning experiments were conducted to assess the feasibility and efficacy of our framework in scenarios involving small- to mid-sized datasets. ResultsCompared with traditional pretraining and fine-tuning pipelines, our approach achieved a higher micro–F1-score of 0.838 and a macro–area under the receiver operating characteristic curve (macro-AUC) of 0.958, which is 10% higher than other methods. Among different prompt learning setups, the combination of mixed templates and soft verbalizers yielded the best performance. Few-shot experiments showed that performance stabilized and the AUC peaked at 500 shots. ConclusionsThese findings underscore the effectiveness and superior performance of prompt learning and fine-tuning for subtasks within pretrained language models in medical practice. Our real-time ICD coding pipeline efficiently converts detailed medical free text into standardized labels, offering promising applications in clinical decision-making. It can assist doctors unfamiliar with the ICD coding system in organizing medical record information, thereby accelerating the medical process and enhancing the efficiency of diagnosis and treatment.https://medinform.jmir.org/2025/1/e63020 |
spellingShingle | Yan Zhuang Junyan Zhang Xiuxing Li Chao Liu Yue Yu Wei Dong Kunlun He Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text JMIR Medical Informatics |
title | Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text |
title_full | Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text |
title_fullStr | Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text |
title_full_unstemmed | Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text |
title_short | Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text |
title_sort | autonomous international classification of diseases coding using pretrained language models and advanced prompt learning techniques evaluation of an automated analysis system using medical text |
url | https://medinform.jmir.org/2025/1/e63020 |
work_keys_str_mv | AT yanzhuang autonomousinternationalclassificationofdiseasescodingusingpretrainedlanguagemodelsandadvancedpromptlearningtechniquesevaluationofanautomatedanalysissystemusingmedicaltext AT junyanzhang autonomousinternationalclassificationofdiseasescodingusingpretrainedlanguagemodelsandadvancedpromptlearningtechniquesevaluationofanautomatedanalysissystemusingmedicaltext AT xiuxingli autonomousinternationalclassificationofdiseasescodingusingpretrainedlanguagemodelsandadvancedpromptlearningtechniquesevaluationofanautomatedanalysissystemusingmedicaltext AT chaoliu autonomousinternationalclassificationofdiseasescodingusingpretrainedlanguagemodelsandadvancedpromptlearningtechniquesevaluationofanautomatedanalysissystemusingmedicaltext AT yueyu autonomousinternationalclassificationofdiseasescodingusingpretrainedlanguagemodelsandadvancedpromptlearningtechniquesevaluationofanautomatedanalysissystemusingmedicaltext AT weidong autonomousinternationalclassificationofdiseasescodingusingpretrainedlanguagemodelsandadvancedpromptlearningtechniquesevaluationofanautomatedanalysissystemusingmedicaltext AT kunlunhe autonomousinternationalclassificationofdiseasescodingusingpretrainedlanguagemodelsandadvancedpromptlearningtechniquesevaluationofanautomatedanalysissystemusingmedicaltext |