A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents
Abstract Safeguarding Personally Identifiable Information (PII) in financial documents is essential to prevent data breaches and maintain regulatory compliance. This research presents a scalable hybrid approach that integrates rule-based Natural Language Processing (NLP), Machine Learning (ML) appro...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-07-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-04971-9 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849335019034640384 |
|---|---|
| author | Kushagra Mishra Harsh Pagare Kanhaiya Sharma |
| author_facet | Kushagra Mishra Harsh Pagare Kanhaiya Sharma |
| author_sort | Kushagra Mishra |
| collection | DOAJ |
| description | Abstract Safeguarding Personally Identifiable Information (PII) in financial documents is essential to prevent data breaches and maintain regulatory compliance. This research presents a scalable hybrid approach that integrates rule-based Natural Language Processing (NLP), Machine Learning (ML) approaches, and a custom Named Entity Recognition (NER) model for the accurate detection and anonymization of Personally Identifiable Information (PII). A varied and accurate synthetic dataset was created to replicate genuine financial document formats, enhancing model training and assessment. The model has attained a precision of 94.7%, a recall of 89.4%, an F1-score of 91.1%, and an overall accuracy of 89.4% on synthetic datasets. Additional validation on actual financial documents, such as audit reports and vendor bills, revealed a consistent performance with an accuracy of 93%. The study utilizes confusion matrices, ROC curves, and precision-recall curves to evaluate the model which further validates the model’s capabilities and generalization ability. The suggested approach provides a robust and efficient solution for protecting sensitive information in operational financial contexts, markedly enhancing current methods for PII protection. |
| format | Article |
| id | doaj-art-aa584f1d9a534cd38efbb41df3ed7dc8 |
| institution | Kabale University |
| issn | 2045-2322 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Reports |
| spelling | doaj-art-aa584f1d9a534cd38efbb41df3ed7dc82025-08-20T03:45:25ZengNature PortfolioScientific Reports2045-23222025-07-0115112710.1038/s41598-025-04971-9A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documentsKushagra Mishra0Harsh Pagare1Kanhaiya Sharma2Department of Computer Science & Engineering, Symbiosis Institute of Technology, Constituent of Symbiosis International (Deemed University)Department of Computer Science & Engineering, Symbiosis Institute of Technology, Constituent of Symbiosis International (Deemed University)Department of Computer Science & Engineering, Symbiosis Institute of Technology, Constituent of Symbiosis International (Deemed University)Abstract Safeguarding Personally Identifiable Information (PII) in financial documents is essential to prevent data breaches and maintain regulatory compliance. This research presents a scalable hybrid approach that integrates rule-based Natural Language Processing (NLP), Machine Learning (ML) approaches, and a custom Named Entity Recognition (NER) model for the accurate detection and anonymization of Personally Identifiable Information (PII). A varied and accurate synthetic dataset was created to replicate genuine financial document formats, enhancing model training and assessment. The model has attained a precision of 94.7%, a recall of 89.4%, an F1-score of 91.1%, and an overall accuracy of 89.4% on synthetic datasets. Additional validation on actual financial documents, such as audit reports and vendor bills, revealed a consistent performance with an accuracy of 93%. The study utilizes confusion matrices, ROC curves, and precision-recall curves to evaluate the model which further validates the model’s capabilities and generalization ability. The suggested approach provides a robust and efficient solution for protecting sensitive information in operational financial contexts, markedly enhancing current methods for PII protection.https://doi.org/10.1038/s41598-025-04971-9Machine learningNatural language processingPersonally identifiable informationData anonymizationFinancial data security |
| spellingShingle | Kushagra Mishra Harsh Pagare Kanhaiya Sharma A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents Scientific Reports Machine learning Natural language processing Personally identifiable information Data anonymization Financial data security |
| title | A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents |
| title_full | A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents |
| title_fullStr | A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents |
| title_full_unstemmed | A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents |
| title_short | A hybrid rule-based NLP and machine learning approach for PII detection and anonymization in financial documents |
| title_sort | hybrid rule based nlp and machine learning approach for pii detection and anonymization in financial documents |
| topic | Machine learning Natural language processing Personally identifiable information Data anonymization Financial data security |
| url | https://doi.org/10.1038/s41598-025-04971-9 |
| work_keys_str_mv | AT kushagramishra ahybridrulebasednlpandmachinelearningapproachforpiidetectionandanonymizationinfinancialdocuments AT harshpagare ahybridrulebasednlpandmachinelearningapproachforpiidetectionandanonymizationinfinancialdocuments AT kanhaiyasharma ahybridrulebasednlpandmachinelearningapproachforpiidetectionandanonymizationinfinancialdocuments AT kushagramishra hybridrulebasednlpandmachinelearningapproachforpiidetectionandanonymizationinfinancialdocuments AT harshpagare hybridrulebasednlpandmachinelearningapproachforpiidetectionandanonymizationinfinancialdocuments AT kanhaiyasharma hybridrulebasednlpandmachinelearningapproachforpiidetectionandanonymizationinfinancialdocuments |