MISTIC: a novel approach for metastasis classification in Italian electronic health records using transformers
Abstract Background Analysis of Electronic Health Records (EHRs) is crucial in real-world evidence (RWE), especially in oncology, as it provides valuable insights into the complex nature of the disease. The implementation of advanced techniques for automated extraction of structured information from...
Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
BMC
2025-04-01
|
| Series: | BMC Medical Informatics and Decision Making |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s12911-025-02994-w |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850184711724335104 |
|---|---|
| author | Livia Lilli Mario Santoro Valeria Masiello Stefano Patarnello Luca Tagliaferri Fabio Marazzi Nikola Dino Capocchiano |
| author_facet | Livia Lilli Mario Santoro Valeria Masiello Stefano Patarnello Luca Tagliaferri Fabio Marazzi Nikola Dino Capocchiano |
| author_sort | Livia Lilli |
| collection | DOAJ |
| description | Abstract Background Analysis of Electronic Health Records (EHRs) is crucial in real-world evidence (RWE), especially in oncology, as it provides valuable insights into the complex nature of the disease. The implementation of advanced techniques for automated extraction of structured information from textual data potentially enables access to expert knowledge in highly specialized contexts. In this paper, we introduce MISTIC, a Natural Language Processing (NLP) approach to classify the presence or absence of metastasis in Italian EHRs, in the breast cancer domain. Methods Our approach consists of a transformer-based framework designed for few-shot learning, requiring a small labelled dataset and minimal computational resources for training. The pipeline includes text segmentation to improve model processing and topic analysis to filter informative content, ensuring relevant input data for classification. Results MISTIC was evaluated across multiple data sources, and compared to several benchmark methodologies, ranging from a pattern-matching system, composed of regex and semantic rules, to BERT-based models implemented in a zero-shot learning setup and Large Language Models (LLMs). The results demonstrate the generalization of our approach, achieving an F-Score above 87% on all the sources, and outperforming the other experiments, with an overall F-Score of 91.2%. Conclusions MISTIC achieves high performance in the Italian metastasis classification task, outperforming rule-based systems, zero-shot BERT models, and LLMs. Its few-shot learning setup offers a computationally efficient alternative to large-scale models, while its segmentation and topic analysis steps enhance explainability by explicitly linking predictions to key textual elements. Furthermore, MISTIC demonstrates strong generalization across different data sources, reinforcing its potential as a scalable and transparent solution for clinical text classification. By extracting high-quality metastatic information from diverse textual data, MISTIC supports medical researchers in analyzing unstructured and highly informative content across a wide range of medical reports. In doing so, it enhances data accessibility and interpretability, addressing a critical gap in health informatics and clinical practice. |
| format | Article |
| id | doaj-art-100679fba74b418fa72962bfcdf668ba |
| institution | OA Journals |
| issn | 1472-6947 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | BMC |
| record_format | Article |
| series | BMC Medical Informatics and Decision Making |
| spelling | doaj-art-100679fba74b418fa72962bfcdf668ba2025-08-20T02:16:59ZengBMCBMC Medical Informatics and Decision Making1472-69472025-04-0125111110.1186/s12911-025-02994-wMISTIC: a novel approach for metastasis classification in Italian electronic health records using transformersLivia Lilli0Mario Santoro1Valeria Masiello2Stefano Patarnello3Luca Tagliaferri4Fabio Marazzi5Nikola Dino Capocchiano6Fondazione Policlinico Universitario Agostino Gemelli IRCCSIstituto per le Applicazioni del Calcolo “Mauro Picone”, Italian National Research CouncilFondazione Policlinico Universitario Agostino Gemelli IRCCSFondazione Policlinico Universitario Agostino Gemelli IRCCSFondazione Policlinico Universitario Agostino Gemelli IRCCSFondazione Policlinico Universitario Agostino Gemelli IRCCSFondazione Policlinico Universitario Agostino Gemelli IRCCSAbstract Background Analysis of Electronic Health Records (EHRs) is crucial in real-world evidence (RWE), especially in oncology, as it provides valuable insights into the complex nature of the disease. The implementation of advanced techniques for automated extraction of structured information from textual data potentially enables access to expert knowledge in highly specialized contexts. In this paper, we introduce MISTIC, a Natural Language Processing (NLP) approach to classify the presence or absence of metastasis in Italian EHRs, in the breast cancer domain. Methods Our approach consists of a transformer-based framework designed for few-shot learning, requiring a small labelled dataset and minimal computational resources for training. The pipeline includes text segmentation to improve model processing and topic analysis to filter informative content, ensuring relevant input data for classification. Results MISTIC was evaluated across multiple data sources, and compared to several benchmark methodologies, ranging from a pattern-matching system, composed of regex and semantic rules, to BERT-based models implemented in a zero-shot learning setup and Large Language Models (LLMs). The results demonstrate the generalization of our approach, achieving an F-Score above 87% on all the sources, and outperforming the other experiments, with an overall F-Score of 91.2%. Conclusions MISTIC achieves high performance in the Italian metastasis classification task, outperforming rule-based systems, zero-shot BERT models, and LLMs. Its few-shot learning setup offers a computationally efficient alternative to large-scale models, while its segmentation and topic analysis steps enhance explainability by explicitly linking predictions to key textual elements. Furthermore, MISTIC demonstrates strong generalization across different data sources, reinforcing its potential as a scalable and transparent solution for clinical text classification. By extracting high-quality metastatic information from diverse textual data, MISTIC supports medical researchers in analyzing unstructured and highly informative content across a wide range of medical reports. In doing so, it enhances data accessibility and interpretability, addressing a critical gap in health informatics and clinical practice.https://doi.org/10.1186/s12911-025-02994-wMetastatic breast cancerNatural language processingSentence transformerLarge language modelFew shot learningElectronic health record |
| spellingShingle | Livia Lilli Mario Santoro Valeria Masiello Stefano Patarnello Luca Tagliaferri Fabio Marazzi Nikola Dino Capocchiano MISTIC: a novel approach for metastasis classification in Italian electronic health records using transformers BMC Medical Informatics and Decision Making Metastatic breast cancer Natural language processing Sentence transformer Large language model Few shot learning Electronic health record |
| title | MISTIC: a novel approach for metastasis classification in Italian electronic health records using transformers |
| title_full | MISTIC: a novel approach for metastasis classification in Italian electronic health records using transformers |
| title_fullStr | MISTIC: a novel approach for metastasis classification in Italian electronic health records using transformers |
| title_full_unstemmed | MISTIC: a novel approach for metastasis classification in Italian electronic health records using transformers |
| title_short | MISTIC: a novel approach for metastasis classification in Italian electronic health records using transformers |
| title_sort | mistic a novel approach for metastasis classification in italian electronic health records using transformers |
| topic | Metastatic breast cancer Natural language processing Sentence transformer Large language model Few shot learning Electronic health record |
| url | https://doi.org/10.1186/s12911-025-02994-w |
| work_keys_str_mv | AT livialilli misticanovelapproachformetastasisclassificationinitalianelectronichealthrecordsusingtransformers AT mariosantoro misticanovelapproachformetastasisclassificationinitalianelectronichealthrecordsusingtransformers AT valeriamasiello misticanovelapproachformetastasisclassificationinitalianelectronichealthrecordsusingtransformers AT stefanopatarnello misticanovelapproachformetastasisclassificationinitalianelectronichealthrecordsusingtransformers AT lucatagliaferri misticanovelapproachformetastasisclassificationinitalianelectronichealthrecordsusingtransformers AT fabiomarazzi misticanovelapproachformetastasisclassificationinitalianelectronichealthrecordsusingtransformers AT nikoladinocapocchiano misticanovelapproachformetastasisclassificationinitalianelectronichealthrecordsusingtransformers |