A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents

Abstract Safeguarding Personally Identifiable Information (PII) in financial documents is essential to prevent data breaches and maintain regulatory compliance. This research presents a scalable hybrid approach that integrates rule-based Natural Language Processing (NLP), Machine Learning (ML) appro...

Full description

Saved in:

Bibliographic Details
Main Authors:	Kushagra Mishra, Harsh Pagare, Kanhaiya Sharma
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-07-01
Series:	Scientific Reports
Subjects:	Machine learning Natural language processing Personally identifiable information Data anonymization Financial data security
Online Access:	https://doi.org/10.1038/s41598-025-04971-9
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849335019034640384
author	Kushagra Mishra Harsh Pagare Kanhaiya Sharma
author_facet	Kushagra Mishra Harsh Pagare Kanhaiya Sharma
author_sort	Kushagra Mishra
collection	DOAJ
description	Abstract Safeguarding Personally Identifiable Information (PII) in financial documents is essential to prevent data breaches and maintain regulatory compliance. This research presents a scalable hybrid approach that integrates rule-based Natural Language Processing (NLP), Machine Learning (ML) approaches, and a custom Named Entity Recognition (NER) model for the accurate detection and anonymization of Personally Identifiable Information (PII). A varied and accurate synthetic dataset was created to replicate genuine financial document formats, enhancing model training and assessment. The model has attained a precision of 94.7%, a recall of 89.4%, an F1-score of 91.1%, and an overall accuracy of 89.4% on synthetic datasets. Additional validation on actual financial documents, such as audit reports and vendor bills, revealed a consistent performance with an accuracy of 93%. The study utilizes confusion matrices, ROC curves, and precision-recall curves to evaluate the model which further validates the model’s capabilities and generalization ability. The suggested approach provides a robust and efficient solution for protecting sensitive information in operational financial contexts, markedly enhancing current methods for PII protection.
format	Article
id	doaj-art-aa584f1d9a534cd38efbb41df3ed7dc8
institution	Kabale University
issn	2045-2322
language	English
publishDate	2025-07-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj-art-aa584f1d9a534cd38efbb41df3ed7dc82025-08-20T03:45:25ZengNature PortfolioScientific Reports2045-23222025-07-0115112710.1038/s41598-025-04971-9A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documentsKushagra Mishra0Harsh Pagare1Kanhaiya Sharma2Department of Computer Science & Engineering, Symbiosis Institute of Technology, Constituent of Symbiosis International (Deemed University)Department of Computer Science & Engineering, Symbiosis Institute of Technology, Constituent of Symbiosis International (Deemed University)Department of Computer Science & Engineering, Symbiosis Institute of Technology, Constituent of Symbiosis International (Deemed University)Abstract Safeguarding Personally Identifiable Information (PII) in financial documents is essential to prevent data breaches and maintain regulatory compliance. This research presents a scalable hybrid approach that integrates rule-based Natural Language Processing (NLP), Machine Learning (ML) approaches, and a custom Named Entity Recognition (NER) model for the accurate detection and anonymization of Personally Identifiable Information (PII). A varied and accurate synthetic dataset was created to replicate genuine financial document formats, enhancing model training and assessment. The model has attained a precision of 94.7%, a recall of 89.4%, an F1-score of 91.1%, and an overall accuracy of 89.4% on synthetic datasets. Additional validation on actual financial documents, such as audit reports and vendor bills, revealed a consistent performance with an accuracy of 93%. The study utilizes confusion matrices, ROC curves, and precision-recall curves to evaluate the model which further validates the model’s capabilities and generalization ability. The suggested approach provides a robust and efficient solution for protecting sensitive information in operational financial contexts, markedly enhancing current methods for PII protection.https://doi.org/10.1038/s41598-025-04971-9Machine learningNatural language processingPersonally identifiable informationData anonymizationFinancial data security
spellingShingle	Kushagra Mishra Harsh Pagare Kanhaiya Sharma A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents Scientific Reports Machine learning Natural language processing Personally identifiable information Data anonymization Financial data security
title	A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents
title_full	A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents
title_fullStr	A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents
title_full_unstemmed	A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents
title_short	A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents
title_sort	hybrid rule based nlp and machine learning approach for pii detection and anonymization in financial documents
topic	Machine learning Natural language processing Personally identifiable information Data anonymization Financial data security
url	https://doi.org/10.1038/s41598-025-04971-9
work_keys_str_mv	AT kushagramishra ahybridrulebasednlpandmachinelearningapproachforpiidetectionandanonymizationinfinancialdocuments AT harshpagare ahybridrulebasednlpandmachinelearningapproachforpiidetectionandanonymizationinfinancialdocuments AT kanhaiyasharma ahybridrulebasednlpandmachinelearningapproachforpiidetectionandanonymizationinfinancialdocuments AT kushagramishra hybridrulebasednlpandmachinelearningapproachforpiidetectionandanonymizationinfinancialdocuments AT harshpagare hybridrulebasednlpandmachinelearningapproachforpiidetectionandanonymizationinfinancialdocuments AT kanhaiyasharma hybridrulebasednlpandmachinelearningapproachforpiidetectionandanonymizationinfinancialdocuments

A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents

Similar Items