Chinese Clinical Named Entity Recognition With Segmentation Synonym Sentence Synthesis Mechanism: Algorithm Development and Validation

Abstract BackgroundClinical named entity recognition (CNER) is a fundamental task in natural language processing used to extract named entities from electronic medical record texts. In recent years, with the continuous development of machine learning, deep learning models have...

Full description

Saved in:
Bibliographic Details
Main Authors: Jian Tang, Zikun Huang, Hongzhen Xu, Hao Zhang, Hailing Huang, Minqiong Tang, Pengsheng Luo, Dong Qin
Format: Article
Language:English
Published: JMIR Publications 2024-11-01
Series:JMIR Medical Informatics
Online Access:https://medinform.jmir.org/2024/1/e60334
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850056846965997568
author Jian Tang
Zikun Huang
Hongzhen Xu
Hao Zhang
Hailing Huang
Minqiong Tang
Pengsheng Luo
Dong Qin
author_facet Jian Tang
Zikun Huang
Hongzhen Xu
Hao Zhang
Hailing Huang
Minqiong Tang
Pengsheng Luo
Dong Qin
author_sort Jian Tang
collection DOAJ
description Abstract BackgroundClinical named entity recognition (CNER) is a fundamental task in natural language processing used to extract named entities from electronic medical record texts. In recent years, with the continuous development of machine learning, deep learning models have replaced traditional machine learning and template-based methods, becoming widely applied in the CNER field. However, due to the complexity of clinical texts, the diversity and large quantity of named entity types, and the unclear boundaries between different entities, existing advanced methods rely to some extent on annotated databases and the scale of embedded dictionaries. ObjectiveThis study aims to address the issues of data scarcity and labeling difficulties in CNER tasks by proposing a dataset augmentation algorithm based on proximity word calculation. MethodsWe propose a Segmentation Synonym Sentence Synthesis (SSSS) algorithm based on neighboring vocabulary, which leverages existing public knowledge without the need for manual expansion of specialized domain dictionaries. Through lexical segmentation, the algorithm replaces new synonymous vocabulary by recombining from vast natural language data, achieving nearby expansion expressions of the dataset. We applied the SSSS algorithm to the Robustly Optimized Bidirectional Encoder Representations from Transformers Pretraining Approach (RoBERTa) + conditional random field (CRF) and RoBERTa + Bidirectional Long Short-Term Memory (BiLSTM) + CRF models and evaluated our models (SSSS + RoBERTa + CRF; SSSS + RoBERTa + BiLSTM + CRF) on the China Conference on Knowledge Graph and Semantic Computing (CCKS) 2017 and 2019 datasets. ResultsOur experiments demonstrated that the models SSSS + RoBERTa + CRF and SSSS + RoBERTa + BiLSTM + CRF achieved F1F1 ConclusionsThe experimental results indicated that our proposed method successfully expanded the dataset and remarkably improved the performance of the model, effectively addressing the challenges of data acquisition, annotation difficulties, and insufficient model generalization performance.
format Article
id doaj-art-9a3c236e66504fe69c797d41ceb3d8bf
institution DOAJ
issn 2291-9694
language English
publishDate 2024-11-01
publisher JMIR Publications
record_format Article
series JMIR Medical Informatics
spelling doaj-art-9a3c236e66504fe69c797d41ceb3d8bf2025-08-20T02:51:35ZengJMIR PublicationsJMIR Medical Informatics2291-96942024-11-0112e60334e6033410.2196/60334Chinese Clinical Named Entity Recognition With Segmentation Synonym Sentence Synthesis Mechanism: Algorithm Development and ValidationJian Tanghttp://orcid.org/0009-0008-2957-4530Zikun Huanghttp://orcid.org/0009-0004-2431-1265Hongzhen Xuhttp://orcid.org/0009-0008-4632-3389Hao Zhanghttp://orcid.org/0009-0005-8893-1970Hailing Huanghttp://orcid.org/0009-0008-0847-7889Minqiong Tanghttp://orcid.org/0009-0003-3198-6140Pengsheng Luohttp://orcid.org/0009-0007-5369-4674Dong Qinhttp://orcid.org/0009-0007-9264-2076 Abstract BackgroundClinical named entity recognition (CNER) is a fundamental task in natural language processing used to extract named entities from electronic medical record texts. In recent years, with the continuous development of machine learning, deep learning models have replaced traditional machine learning and template-based methods, becoming widely applied in the CNER field. However, due to the complexity of clinical texts, the diversity and large quantity of named entity types, and the unclear boundaries between different entities, existing advanced methods rely to some extent on annotated databases and the scale of embedded dictionaries. ObjectiveThis study aims to address the issues of data scarcity and labeling difficulties in CNER tasks by proposing a dataset augmentation algorithm based on proximity word calculation. MethodsWe propose a Segmentation Synonym Sentence Synthesis (SSSS) algorithm based on neighboring vocabulary, which leverages existing public knowledge without the need for manual expansion of specialized domain dictionaries. Through lexical segmentation, the algorithm replaces new synonymous vocabulary by recombining from vast natural language data, achieving nearby expansion expressions of the dataset. We applied the SSSS algorithm to the Robustly Optimized Bidirectional Encoder Representations from Transformers Pretraining Approach (RoBERTa) + conditional random field (CRF) and RoBERTa + Bidirectional Long Short-Term Memory (BiLSTM) + CRF models and evaluated our models (SSSS + RoBERTa + CRF; SSSS + RoBERTa + BiLSTM + CRF) on the China Conference on Knowledge Graph and Semantic Computing (CCKS) 2017 and 2019 datasets. ResultsOur experiments demonstrated that the models SSSS + RoBERTa + CRF and SSSS + RoBERTa + BiLSTM + CRF achieved F1F1 ConclusionsThe experimental results indicated that our proposed method successfully expanded the dataset and remarkably improved the performance of the model, effectively addressing the challenges of data acquisition, annotation difficulties, and insufficient model generalization performance.https://medinform.jmir.org/2024/1/e60334
spellingShingle Jian Tang
Zikun Huang
Hongzhen Xu
Hao Zhang
Hailing Huang
Minqiong Tang
Pengsheng Luo
Dong Qin
Chinese Clinical Named Entity Recognition With Segmentation Synonym Sentence Synthesis Mechanism: Algorithm Development and Validation
JMIR Medical Informatics
title Chinese Clinical Named Entity Recognition With Segmentation Synonym Sentence Synthesis Mechanism: Algorithm Development and Validation
title_full Chinese Clinical Named Entity Recognition With Segmentation Synonym Sentence Synthesis Mechanism: Algorithm Development and Validation
title_fullStr Chinese Clinical Named Entity Recognition With Segmentation Synonym Sentence Synthesis Mechanism: Algorithm Development and Validation
title_full_unstemmed Chinese Clinical Named Entity Recognition With Segmentation Synonym Sentence Synthesis Mechanism: Algorithm Development and Validation
title_short Chinese Clinical Named Entity Recognition With Segmentation Synonym Sentence Synthesis Mechanism: Algorithm Development and Validation
title_sort chinese clinical named entity recognition with segmentation synonym sentence synthesis mechanism algorithm development and validation
url https://medinform.jmir.org/2024/1/e60334
work_keys_str_mv AT jiantang chineseclinicalnamedentityrecognitionwithsegmentationsynonymsentencesynthesismechanismalgorithmdevelopmentandvalidation
AT zikunhuang chineseclinicalnamedentityrecognitionwithsegmentationsynonymsentencesynthesismechanismalgorithmdevelopmentandvalidation
AT hongzhenxu chineseclinicalnamedentityrecognitionwithsegmentationsynonymsentencesynthesismechanismalgorithmdevelopmentandvalidation
AT haozhang chineseclinicalnamedentityrecognitionwithsegmentationsynonymsentencesynthesismechanismalgorithmdevelopmentandvalidation
AT hailinghuang chineseclinicalnamedentityrecognitionwithsegmentationsynonymsentencesynthesismechanismalgorithmdevelopmentandvalidation
AT minqiongtang chineseclinicalnamedentityrecognitionwithsegmentationsynonymsentencesynthesismechanismalgorithmdevelopmentandvalidation
AT pengshengluo chineseclinicalnamedentityrecognitionwithsegmentationsynonymsentencesynthesismechanismalgorithmdevelopmentandvalidation
AT dongqin chineseclinicalnamedentityrecognitionwithsegmentationsynonymsentencesynthesismechanismalgorithmdevelopmentandvalidation