Leveraging natural language processing for efficient information extraction from breast cancer pathology reports: Single-institution study.

<h4>Background</h4>Pathology reports provide important information for accurate diagnosis of cancer and optimal treatment decision making. In particular, breast cancer has known to be the most common cancer in women worldwide.<h4>Objective</h4>For the data extraction of breas...

Full description

Saved in:

Bibliographic Details
Main Authors:	Phillip Park, Yeonho Choi, Nayoung Han, Ye-Lin Park, Juyeon Hwang, Heejung Chae, Chong Woo Yoo, Kui Son Choi, Hyun-Jin Kim
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2025-01-01
Series:	PLoS ONE
Online Access:	https://doi.org/10.1371/journal.pone.0318726
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849323353923387392
author	Phillip Park Yeonho Choi Nayoung Han Ye-Lin Park Juyeon Hwang Heejung Chae Chong Woo Yoo Kui Son Choi Hyun-Jin Kim
author_facet	Phillip Park Yeonho Choi Nayoung Han Ye-Lin Park Juyeon Hwang Heejung Chae Chong Woo Yoo Kui Son Choi Hyun-Jin Kim
author_sort	Phillip Park
collection	DOAJ
description	<h4>Background</h4>Pathology reports provide important information for accurate diagnosis of cancer and optimal treatment decision making. In particular, breast cancer has known to be the most common cancer in women worldwide.<h4>Objective</h4>For the data extraction of breast cancer pathology reports in a single institute, we assessed the accuracy of methods between regular expression and natural language processing (NLP).<h4>Methods</h4>A total of 1,215 breast cancer pathology reports were annotated for NLP model development. As NLP models, we considered three BERT models with specific vocabularies including BERT-basic, BioBERT, and ClinicalBERT. K-fold cross-validation was used to verify the performance of the BERT model. The results between the regular expression and the BERT model were compared using the named entity recognition (NER) techniques.<h4>Results</h4>Among three BERT models, BioBERT was the most accurate parsing model (average performance = 0.99901) for breast cancer pathology when set to k = 5. BioBERT also had the lowest error rate for all items in the breast cancer pathology report compared to other BERT models (accuracy for all variables ≥ 0.9). Therefore, we finally selected BioBERT as the NLP model. When comparing the results of BioBERT and regular expressions using NER, we identified that BioBERT was more accurate than regular expression method, especially for some items such as intraductal component (BioBERT: 1.0, RegEx: 0.1644), lymph node (BioBERT: 0.9886, RegEx: 0.4792), and lymphovascular invasion (BioBERT: 0.9918, RegEx: 0.3759).<h4>Conclusions</h4>Our results showed that the NLP model, BioBERT, had higher accuracy than regular expression, suggesting the importance of BioBERT in the processing of breast cancer pathology reports.
format	Article
id	doaj-art-e78be3a4002b4a74976f32debea09841
institution	Kabale University
issn	1932-6203
language	English
publishDate	2025-01-01
publisher	Public Library of Science (PLoS)
record_format	Article
series	PLoS ONE
spelling	doaj-art-e78be3a4002b4a74976f32debea098412025-08-20T03:49:03ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01202e031872610.1371/journal.pone.0318726Leveraging natural language processing for efficient information extraction from breast cancer pathology reports: Single-institution study.Phillip ParkYeonho ChoiNayoung HanYe-Lin ParkJuyeon HwangHeejung ChaeChong Woo YooKui Son ChoiHyun-Jin Kim<h4>Background</h4>Pathology reports provide important information for accurate diagnosis of cancer and optimal treatment decision making. In particular, breast cancer has known to be the most common cancer in women worldwide.<h4>Objective</h4>For the data extraction of breast cancer pathology reports in a single institute, we assessed the accuracy of methods between regular expression and natural language processing (NLP).<h4>Methods</h4>A total of 1,215 breast cancer pathology reports were annotated for NLP model development. As NLP models, we considered three BERT models with specific vocabularies including BERT-basic, BioBERT, and ClinicalBERT. K-fold cross-validation was used to verify the performance of the BERT model. The results between the regular expression and the BERT model were compared using the named entity recognition (NER) techniques.<h4>Results</h4>Among three BERT models, BioBERT was the most accurate parsing model (average performance = 0.99901) for breast cancer pathology when set to k = 5. BioBERT also had the lowest error rate for all items in the breast cancer pathology report compared to other BERT models (accuracy for all variables ≥ 0.9). Therefore, we finally selected BioBERT as the NLP model. When comparing the results of BioBERT and regular expressions using NER, we identified that BioBERT was more accurate than regular expression method, especially for some items such as intraductal component (BioBERT: 1.0, RegEx: 0.1644), lymph node (BioBERT: 0.9886, RegEx: 0.4792), and lymphovascular invasion (BioBERT: 0.9918, RegEx: 0.3759).<h4>Conclusions</h4>Our results showed that the NLP model, BioBERT, had higher accuracy than regular expression, suggesting the importance of BioBERT in the processing of breast cancer pathology reports.https://doi.org/10.1371/journal.pone.0318726
spellingShingle	Phillip Park Yeonho Choi Nayoung Han Ye-Lin Park Juyeon Hwang Heejung Chae Chong Woo Yoo Kui Son Choi Hyun-Jin Kim Leveraging natural language processing for efficient information extraction from breast cancer pathology reports: Single-institution study. PLoS ONE
title	Leveraging natural language processing for efficient information extraction from breast cancer pathology reports: Single-institution study.
title_full	Leveraging natural language processing for efficient information extraction from breast cancer pathology reports: Single-institution study.
title_fullStr	Leveraging natural language processing for efficient information extraction from breast cancer pathology reports: Single-institution study.
title_full_unstemmed	Leveraging natural language processing for efficient information extraction from breast cancer pathology reports: Single-institution study.
title_short	Leveraging natural language processing for efficient information extraction from breast cancer pathology reports: Single-institution study.
title_sort	leveraging natural language processing for efficient information extraction from breast cancer pathology reports single institution study
url	https://doi.org/10.1371/journal.pone.0318726
work_keys_str_mv	AT phillippark leveragingnaturallanguageprocessingforefficientinformationextractionfrombreastcancerpathologyreportssingleinstitutionstudy AT yeonhochoi leveragingnaturallanguageprocessingforefficientinformationextractionfrombreastcancerpathologyreportssingleinstitutionstudy AT nayounghan leveragingnaturallanguageprocessingforefficientinformationextractionfrombreastcancerpathologyreportssingleinstitutionstudy AT yelinpark leveragingnaturallanguageprocessingforefficientinformationextractionfrombreastcancerpathologyreportssingleinstitutionstudy AT juyeonhwang leveragingnaturallanguageprocessingforefficientinformationextractionfrombreastcancerpathologyreportssingleinstitutionstudy AT heejungchae leveragingnaturallanguageprocessingforefficientinformationextractionfrombreastcancerpathologyreportssingleinstitutionstudy AT chongwooyoo leveragingnaturallanguageprocessingforefficientinformationextractionfrombreastcancerpathologyreportssingleinstitutionstudy AT kuisonchoi leveragingnaturallanguageprocessingforefficientinformationextractionfrombreastcancerpathologyreportssingleinstitutionstudy AT hyunjinkim leveragingnaturallanguageprocessingforefficientinformationextractionfrombreastcancerpathologyreportssingleinstitutionstudy

Leveraging natural language processing for efficient information extraction from breast cancer pathology reports: Single-institution study.

Similar Items