A novel framework for Chinese personal sensitive information detection

With the rapid development of social networks, the harm caused by the leakage of personal sensitive information is becoming increasingly serious. In order to detect and identify personal sensitive information, existing methods build matching rules to detect specific sensitive entities and use machin...

Full description

Saved in:
Bibliographic Details
Main Authors: Chenglong Ren, Xiao Lan, Xingshu Chen, Yonggang Luo, Shuhua Ruan
Format: Article
Language:English
Published: Taylor & Francis Group 2024-12-01
Series:Connection Science
Subjects:
Online Access:https://www.tandfonline.com/doi/10.1080/09540091.2023.2298310
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849701006722465792
author Chenglong Ren
Xiao Lan
Xingshu Chen
Yonggang Luo
Shuhua Ruan
author_facet Chenglong Ren
Xiao Lan
Xingshu Chen
Yonggang Luo
Shuhua Ruan
author_sort Chenglong Ren
collection DOAJ
description With the rapid development of social networks, the harm caused by the leakage of personal sensitive information is becoming increasingly serious. In order to detect and identify personal sensitive information, existing methods build matching rules to detect specific sensitive entities and use machine learning methods to classify sensitive text. These methods face challenges in context analysis and adapting to Chinese language characteristics. This paper proposes CPSID, a method for detecting Chinese personal sensitive information. On the one hand, CPSID utilises rule matching to detect specific personal sensitive information only containing letters and numbers. More importantly, CPSID constructs a sequence labelling model named EBC (ELECTRA-BiLSTM-CRF) to detect more complex personal sensitive information that consist of Chinese characters. The EBC model uses the latest ELECTRA algorithm to implement word embedding, and uses BiLSTM and CRF models to extract personal sensitive information, which can detect Chinese sensitive entities accurately by analysing context information. The model achieves an F1 score of 94.09% on Chinese datasets, outperforming other similar models. Additionally, experiments on real data show CPSID has a better detection result than individual methods (rule matching or sequence labelling).
format Article
id doaj-art-52ee45cd3db94b119c38861ac6930f8b
institution DOAJ
issn 0954-0091
1360-0494
language English
publishDate 2024-12-01
publisher Taylor & Francis Group
record_format Article
series Connection Science
spelling doaj-art-52ee45cd3db94b119c38861ac6930f8b2025-08-20T03:18:05ZengTaylor & Francis GroupConnection Science0954-00911360-04942024-12-0136110.1080/09540091.2023.2298310A novel framework for Chinese personal sensitive information detectionChenglong Ren0Xiao Lan1Xingshu Chen2Yonggang Luo3Shuhua Ruan4School of Cyber Science and Engineering, Sichuan University, Chengdu, People’s Republic of ChinaCyber Science Research Institute, Sichuan University, Chengdu, People’s Republic of ChinaSchool of Cyber Science and Engineering, Sichuan University, Chengdu, People’s Republic of ChinaCyber Science Research Institute, Sichuan University, Chengdu, People’s Republic of ChinaSchool of Cyber Science and Engineering, Sichuan University, Chengdu, People’s Republic of ChinaWith the rapid development of social networks, the harm caused by the leakage of personal sensitive information is becoming increasingly serious. In order to detect and identify personal sensitive information, existing methods build matching rules to detect specific sensitive entities and use machine learning methods to classify sensitive text. These methods face challenges in context analysis and adapting to Chinese language characteristics. This paper proposes CPSID, a method for detecting Chinese personal sensitive information. On the one hand, CPSID utilises rule matching to detect specific personal sensitive information only containing letters and numbers. More importantly, CPSID constructs a sequence labelling model named EBC (ELECTRA-BiLSTM-CRF) to detect more complex personal sensitive information that consist of Chinese characters. The EBC model uses the latest ELECTRA algorithm to implement word embedding, and uses BiLSTM and CRF models to extract personal sensitive information, which can detect Chinese sensitive entities accurately by analysing context information. The model achieves an F1 score of 94.09% on Chinese datasets, outperforming other similar models. Additionally, experiments on real data show CPSID has a better detection result than individual methods (rule matching or sequence labelling).https://www.tandfonline.com/doi/10.1080/09540091.2023.2298310Chinesepersonal sensitive informationrule matchingsequence labelingcontext analysis
spellingShingle Chenglong Ren
Xiao Lan
Xingshu Chen
Yonggang Luo
Shuhua Ruan
A novel framework for Chinese personal sensitive information detection
Connection Science
Chinese
personal sensitive information
rule matching
sequence labeling
context analysis
title A novel framework for Chinese personal sensitive information detection
title_full A novel framework for Chinese personal sensitive information detection
title_fullStr A novel framework for Chinese personal sensitive information detection
title_full_unstemmed A novel framework for Chinese personal sensitive information detection
title_short A novel framework for Chinese personal sensitive information detection
title_sort novel framework for chinese personal sensitive information detection
topic Chinese
personal sensitive information
rule matching
sequence labeling
context analysis
url https://www.tandfonline.com/doi/10.1080/09540091.2023.2298310
work_keys_str_mv AT chenglongren anovelframeworkforchinesepersonalsensitiveinformationdetection
AT xiaolan anovelframeworkforchinesepersonalsensitiveinformationdetection
AT xingshuchen anovelframeworkforchinesepersonalsensitiveinformationdetection
AT yonggangluo anovelframeworkforchinesepersonalsensitiveinformationdetection
AT shuhuaruan anovelframeworkforchinesepersonalsensitiveinformationdetection
AT chenglongren novelframeworkforchinesepersonalsensitiveinformationdetection
AT xiaolan novelframeworkforchinesepersonalsensitiveinformationdetection
AT xingshuchen novelframeworkforchinesepersonalsensitiveinformationdetection
AT yonggangluo novelframeworkforchinesepersonalsensitiveinformationdetection
AT shuhuaruan novelframeworkforchinesepersonalsensitiveinformationdetection