A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents

Abstract Safeguarding Personally Identifiable Information (PII) in financial documents is essential to prevent data breaches and maintain regulatory compliance. This research presents a scalable hybrid approach that integrates rule-based Natural Language Processing (NLP), Machine Learning (ML) appro...

Full description

Saved in:
Bibliographic Details
Main Authors: Kushagra Mishra, Harsh Pagare, Kanhaiya Sharma
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-04971-9
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849335019034640384
author Kushagra Mishra
Harsh Pagare
Kanhaiya Sharma
author_facet Kushagra Mishra
Harsh Pagare
Kanhaiya Sharma
author_sort Kushagra Mishra
collection DOAJ
description Abstract Safeguarding Personally Identifiable Information (PII) in financial documents is essential to prevent data breaches and maintain regulatory compliance. This research presents a scalable hybrid approach that integrates rule-based Natural Language Processing (NLP), Machine Learning (ML) approaches, and a custom Named Entity Recognition (NER) model for the accurate detection and anonymization of Personally Identifiable Information (PII). A varied and accurate synthetic dataset was created to replicate genuine financial document formats, enhancing model training and assessment. The model has attained a precision of 94.7%, a recall of 89.4%, an F1-score of 91.1%, and an overall accuracy of 89.4% on synthetic datasets. Additional validation on actual financial documents, such as audit reports and vendor bills, revealed a consistent performance with an accuracy of 93%. The study utilizes confusion matrices, ROC curves, and precision-recall curves to evaluate the model which further validates the model’s capabilities and generalization ability. The suggested approach provides a robust and efficient solution for protecting sensitive information in operational financial contexts, markedly enhancing current methods for PII protection.
format Article
id doaj-art-aa584f1d9a534cd38efbb41df3ed7dc8
institution Kabale University
issn 2045-2322
language English
publishDate 2025-07-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-aa584f1d9a534cd38efbb41df3ed7dc82025-08-20T03:45:25ZengNature PortfolioScientific Reports2045-23222025-07-0115112710.1038/s41598-025-04971-9A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documentsKushagra Mishra0Harsh Pagare1Kanhaiya Sharma2Department of Computer Science & Engineering, Symbiosis Institute of Technology, Constituent of Symbiosis International (Deemed University)Department of Computer Science & Engineering, Symbiosis Institute of Technology, Constituent of Symbiosis International (Deemed University)Department of Computer Science & Engineering, Symbiosis Institute of Technology, Constituent of Symbiosis International (Deemed University)Abstract Safeguarding Personally Identifiable Information (PII) in financial documents is essential to prevent data breaches and maintain regulatory compliance. This research presents a scalable hybrid approach that integrates rule-based Natural Language Processing (NLP), Machine Learning (ML) approaches, and a custom Named Entity Recognition (NER) model for the accurate detection and anonymization of Personally Identifiable Information (PII). A varied and accurate synthetic dataset was created to replicate genuine financial document formats, enhancing model training and assessment. The model has attained a precision of 94.7%, a recall of 89.4%, an F1-score of 91.1%, and an overall accuracy of 89.4% on synthetic datasets. Additional validation on actual financial documents, such as audit reports and vendor bills, revealed a consistent performance with an accuracy of 93%. The study utilizes confusion matrices, ROC curves, and precision-recall curves to evaluate the model which further validates the model’s capabilities and generalization ability. The suggested approach provides a robust and efficient solution for protecting sensitive information in operational financial contexts, markedly enhancing current methods for PII protection.https://doi.org/10.1038/s41598-025-04971-9Machine learningNatural language processingPersonally identifiable informationData anonymizationFinancial data security
spellingShingle Kushagra Mishra
Harsh Pagare
Kanhaiya Sharma
A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents
Scientific Reports
Machine learning
Natural language processing
Personally identifiable information
Data anonymization
Financial data security
title A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents
title_full A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents
title_fullStr A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents
title_full_unstemmed A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents
title_short A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents
title_sort hybrid rule based nlp and machine learning approach for pii detection and anonymization in financial documents
topic Machine learning
Natural language processing
Personally identifiable information
Data anonymization
Financial data security
url https://doi.org/10.1038/s41598-025-04971-9
work_keys_str_mv AT kushagramishra ahybridrulebasednlpandmachinelearningapproachforpiidetectionandanonymizationinfinancialdocuments
AT harshpagare ahybridrulebasednlpandmachinelearningapproachforpiidetectionandanonymizationinfinancialdocuments
AT kanhaiyasharma ahybridrulebasednlpandmachinelearningapproachforpiidetectionandanonymizationinfinancialdocuments
AT kushagramishra hybridrulebasednlpandmachinelearningapproachforpiidetectionandanonymizationinfinancialdocuments
AT harshpagare hybridrulebasednlpandmachinelearningapproachforpiidetectionandanonymizationinfinancialdocuments
AT kanhaiyasharma hybridrulebasednlpandmachinelearningapproachforpiidetectionandanonymizationinfinancialdocuments