MISTIC: a novel approach for metastasis classification in Italian electronic health records using transformers

Abstract Background Analysis of Electronic Health Records (EHRs) is crucial in real-world evidence (RWE), especially in oncology, as it provides valuable insights into the complex nature of the disease. The implementation of advanced techniques for automated extraction of structured information from...

Full description

Saved in:
Bibliographic Details
Main Authors: Livia Lilli, Mario Santoro, Valeria Masiello, Stefano Patarnello, Luca Tagliaferri, Fabio Marazzi, Nikola Dino Capocchiano
Format: Article
Language:English
Published: BMC 2025-04-01
Series:BMC Medical Informatics and Decision Making
Subjects:
Online Access:https://doi.org/10.1186/s12911-025-02994-w
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850184711724335104
author Livia Lilli
Mario Santoro
Valeria Masiello
Stefano Patarnello
Luca Tagliaferri
Fabio Marazzi
Nikola Dino Capocchiano
author_facet Livia Lilli
Mario Santoro
Valeria Masiello
Stefano Patarnello
Luca Tagliaferri
Fabio Marazzi
Nikola Dino Capocchiano
author_sort Livia Lilli
collection DOAJ
description Abstract Background Analysis of Electronic Health Records (EHRs) is crucial in real-world evidence (RWE), especially in oncology, as it provides valuable insights into the complex nature of the disease. The implementation of advanced techniques for automated extraction of structured information from textual data potentially enables access to expert knowledge in highly specialized contexts. In this paper, we introduce MISTIC, a Natural Language Processing (NLP) approach to classify the presence or absence of metastasis in Italian EHRs, in the breast cancer domain. Methods Our approach consists of a transformer-based framework designed for few-shot learning, requiring a small labelled dataset and minimal computational resources for training. The pipeline includes text segmentation to improve model processing and topic analysis to filter informative content, ensuring relevant input data for classification. Results MISTIC was evaluated across multiple data sources, and compared to several benchmark methodologies, ranging from a pattern-matching system, composed of regex and semantic rules, to BERT-based models implemented in a zero-shot learning setup and Large Language Models (LLMs). The results demonstrate the generalization of our approach, achieving an F-Score above 87% on all the sources, and outperforming the other experiments, with an overall F-Score of 91.2%. Conclusions MISTIC achieves high performance in the Italian metastasis classification task, outperforming rule-based systems, zero-shot BERT models, and LLMs. Its few-shot learning setup offers a computationally efficient alternative to large-scale models, while its segmentation and topic analysis steps enhance explainability by explicitly linking predictions to key textual elements. Furthermore, MISTIC demonstrates strong generalization across different data sources, reinforcing its potential as a scalable and transparent solution for clinical text classification. By extracting high-quality metastatic information from diverse textual data, MISTIC supports medical researchers in analyzing unstructured and highly informative content across a wide range of medical reports. In doing so, it enhances data accessibility and interpretability, addressing a critical gap in health informatics and clinical practice.
format Article
id doaj-art-100679fba74b418fa72962bfcdf668ba
institution OA Journals
issn 1472-6947
language English
publishDate 2025-04-01
publisher BMC
record_format Article
series BMC Medical Informatics and Decision Making
spelling doaj-art-100679fba74b418fa72962bfcdf668ba2025-08-20T02:16:59ZengBMCBMC Medical Informatics and Decision Making1472-69472025-04-0125111110.1186/s12911-025-02994-wMISTIC: a novel approach for metastasis classification in Italian electronic health records using transformersLivia Lilli0Mario Santoro1Valeria Masiello2Stefano Patarnello3Luca Tagliaferri4Fabio Marazzi5Nikola Dino Capocchiano6Fondazione Policlinico Universitario Agostino Gemelli IRCCSIstituto per le Applicazioni del Calcolo “Mauro Picone”, Italian National Research CouncilFondazione Policlinico Universitario Agostino Gemelli IRCCSFondazione Policlinico Universitario Agostino Gemelli IRCCSFondazione Policlinico Universitario Agostino Gemelli IRCCSFondazione Policlinico Universitario Agostino Gemelli IRCCSFondazione Policlinico Universitario Agostino Gemelli IRCCSAbstract Background Analysis of Electronic Health Records (EHRs) is crucial in real-world evidence (RWE), especially in oncology, as it provides valuable insights into the complex nature of the disease. The implementation of advanced techniques for automated extraction of structured information from textual data potentially enables access to expert knowledge in highly specialized contexts. In this paper, we introduce MISTIC, a Natural Language Processing (NLP) approach to classify the presence or absence of metastasis in Italian EHRs, in the breast cancer domain. Methods Our approach consists of a transformer-based framework designed for few-shot learning, requiring a small labelled dataset and minimal computational resources for training. The pipeline includes text segmentation to improve model processing and topic analysis to filter informative content, ensuring relevant input data for classification. Results MISTIC was evaluated across multiple data sources, and compared to several benchmark methodologies, ranging from a pattern-matching system, composed of regex and semantic rules, to BERT-based models implemented in a zero-shot learning setup and Large Language Models (LLMs). The results demonstrate the generalization of our approach, achieving an F-Score above 87% on all the sources, and outperforming the other experiments, with an overall F-Score of 91.2%. Conclusions MISTIC achieves high performance in the Italian metastasis classification task, outperforming rule-based systems, zero-shot BERT models, and LLMs. Its few-shot learning setup offers a computationally efficient alternative to large-scale models, while its segmentation and topic analysis steps enhance explainability by explicitly linking predictions to key textual elements. Furthermore, MISTIC demonstrates strong generalization across different data sources, reinforcing its potential as a scalable and transparent solution for clinical text classification. By extracting high-quality metastatic information from diverse textual data, MISTIC supports medical researchers in analyzing unstructured and highly informative content across a wide range of medical reports. In doing so, it enhances data accessibility and interpretability, addressing a critical gap in health informatics and clinical practice.https://doi.org/10.1186/s12911-025-02994-wMetastatic breast cancerNatural language processingSentence transformerLarge language modelFew shot learningElectronic health record
spellingShingle Livia Lilli
Mario Santoro
Valeria Masiello
Stefano Patarnello
Luca Tagliaferri
Fabio Marazzi
Nikola Dino Capocchiano
MISTIC: a novel approach for metastasis classification in Italian electronic health records using transformers
BMC Medical Informatics and Decision Making
Metastatic breast cancer
Natural language processing
Sentence transformer
Large language model
Few shot learning
Electronic health record
title MISTIC: a novel approach for metastasis classification in Italian electronic health records using transformers
title_full MISTIC: a novel approach for metastasis classification in Italian electronic health records using transformers
title_fullStr MISTIC: a novel approach for metastasis classification in Italian electronic health records using transformers
title_full_unstemmed MISTIC: a novel approach for metastasis classification in Italian electronic health records using transformers
title_short MISTIC: a novel approach for metastasis classification in Italian electronic health records using transformers
title_sort mistic a novel approach for metastasis classification in italian electronic health records using transformers
topic Metastatic breast cancer
Natural language processing
Sentence transformer
Large language model
Few shot learning
Electronic health record
url https://doi.org/10.1186/s12911-025-02994-w
work_keys_str_mv AT livialilli misticanovelapproachformetastasisclassificationinitalianelectronichealthrecordsusingtransformers
AT mariosantoro misticanovelapproachformetastasisclassificationinitalianelectronichealthrecordsusingtransformers
AT valeriamasiello misticanovelapproachformetastasisclassificationinitalianelectronichealthrecordsusingtransformers
AT stefanopatarnello misticanovelapproachformetastasisclassificationinitalianelectronichealthrecordsusingtransformers
AT lucatagliaferri misticanovelapproachformetastasisclassificationinitalianelectronichealthrecordsusingtransformers
AT fabiomarazzi misticanovelapproachformetastasisclassificationinitalianelectronichealthrecordsusingtransformers
AT nikoladinocapocchiano misticanovelapproachformetastasisclassificationinitalianelectronichealthrecordsusingtransformers