A novel framework for Chinese personal sensitive information detection
With the rapid development of social networks, the harm caused by the leakage of personal sensitive information is becoming increasingly serious. In order to detect and identify personal sensitive information, existing methods build matching rules to detect specific sensitive entities and use machin...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Taylor & Francis Group
2024-12-01
|
| Series: | Connection Science |
| Subjects: | |
| Online Access: | https://www.tandfonline.com/doi/10.1080/09540091.2023.2298310 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849701006722465792 |
|---|---|
| author | Chenglong Ren Xiao Lan Xingshu Chen Yonggang Luo Shuhua Ruan |
| author_facet | Chenglong Ren Xiao Lan Xingshu Chen Yonggang Luo Shuhua Ruan |
| author_sort | Chenglong Ren |
| collection | DOAJ |
| description | With the rapid development of social networks, the harm caused by the leakage of personal sensitive information is becoming increasingly serious. In order to detect and identify personal sensitive information, existing methods build matching rules to detect specific sensitive entities and use machine learning methods to classify sensitive text. These methods face challenges in context analysis and adapting to Chinese language characteristics. This paper proposes CPSID, a method for detecting Chinese personal sensitive information. On the one hand, CPSID utilises rule matching to detect specific personal sensitive information only containing letters and numbers. More importantly, CPSID constructs a sequence labelling model named EBC (ELECTRA-BiLSTM-CRF) to detect more complex personal sensitive information that consist of Chinese characters. The EBC model uses the latest ELECTRA algorithm to implement word embedding, and uses BiLSTM and CRF models to extract personal sensitive information, which can detect Chinese sensitive entities accurately by analysing context information. The model achieves an F1 score of 94.09% on Chinese datasets, outperforming other similar models. Additionally, experiments on real data show CPSID has a better detection result than individual methods (rule matching or sequence labelling). |
| format | Article |
| id | doaj-art-52ee45cd3db94b119c38861ac6930f8b |
| institution | DOAJ |
| issn | 0954-0091 1360-0494 |
| language | English |
| publishDate | 2024-12-01 |
| publisher | Taylor & Francis Group |
| record_format | Article |
| series | Connection Science |
| spelling | doaj-art-52ee45cd3db94b119c38861ac6930f8b2025-08-20T03:18:05ZengTaylor & Francis GroupConnection Science0954-00911360-04942024-12-0136110.1080/09540091.2023.2298310A novel framework for Chinese personal sensitive information detectionChenglong Ren0Xiao Lan1Xingshu Chen2Yonggang Luo3Shuhua Ruan4School of Cyber Science and Engineering, Sichuan University, Chengdu, People’s Republic of ChinaCyber Science Research Institute, Sichuan University, Chengdu, People’s Republic of ChinaSchool of Cyber Science and Engineering, Sichuan University, Chengdu, People’s Republic of ChinaCyber Science Research Institute, Sichuan University, Chengdu, People’s Republic of ChinaSchool of Cyber Science and Engineering, Sichuan University, Chengdu, People’s Republic of ChinaWith the rapid development of social networks, the harm caused by the leakage of personal sensitive information is becoming increasingly serious. In order to detect and identify personal sensitive information, existing methods build matching rules to detect specific sensitive entities and use machine learning methods to classify sensitive text. These methods face challenges in context analysis and adapting to Chinese language characteristics. This paper proposes CPSID, a method for detecting Chinese personal sensitive information. On the one hand, CPSID utilises rule matching to detect specific personal sensitive information only containing letters and numbers. More importantly, CPSID constructs a sequence labelling model named EBC (ELECTRA-BiLSTM-CRF) to detect more complex personal sensitive information that consist of Chinese characters. The EBC model uses the latest ELECTRA algorithm to implement word embedding, and uses BiLSTM and CRF models to extract personal sensitive information, which can detect Chinese sensitive entities accurately by analysing context information. The model achieves an F1 score of 94.09% on Chinese datasets, outperforming other similar models. Additionally, experiments on real data show CPSID has a better detection result than individual methods (rule matching or sequence labelling).https://www.tandfonline.com/doi/10.1080/09540091.2023.2298310Chinesepersonal sensitive informationrule matchingsequence labelingcontext analysis |
| spellingShingle | Chenglong Ren Xiao Lan Xingshu Chen Yonggang Luo Shuhua Ruan A novel framework for Chinese personal sensitive information detection Connection Science Chinese personal sensitive information rule matching sequence labeling context analysis |
| title | A novel framework for Chinese personal sensitive information detection |
| title_full | A novel framework for Chinese personal sensitive information detection |
| title_fullStr | A novel framework for Chinese personal sensitive information detection |
| title_full_unstemmed | A novel framework for Chinese personal sensitive information detection |
| title_short | A novel framework for Chinese personal sensitive information detection |
| title_sort | novel framework for chinese personal sensitive information detection |
| topic | Chinese personal sensitive information rule matching sequence labeling context analysis |
| url | https://www.tandfonline.com/doi/10.1080/09540091.2023.2298310 |
| work_keys_str_mv | AT chenglongren anovelframeworkforchinesepersonalsensitiveinformationdetection AT xiaolan anovelframeworkforchinesepersonalsensitiveinformationdetection AT xingshuchen anovelframeworkforchinesepersonalsensitiveinformationdetection AT yonggangluo anovelframeworkforchinesepersonalsensitiveinformationdetection AT shuhuaruan anovelframeworkforchinesepersonalsensitiveinformationdetection AT chenglongren novelframeworkforchinesepersonalsensitiveinformationdetection AT xiaolan novelframeworkforchinesepersonalsensitiveinformationdetection AT xingshuchen novelframeworkforchinesepersonalsensitiveinformationdetection AT yonggangluo novelframeworkforchinesepersonalsensitiveinformationdetection AT shuhuaruan novelframeworkforchinesepersonalsensitiveinformationdetection |