An Application Programming Interface (API) Sensitive Data Identification Method Based on the Federated Large Language Model

The traditional methods for identifying sensitive data in APIs mainly encompass rule-based and machine learning-based approaches. However, these methods suffer from inadequacies in terms of security and robustness, exhibit high false positive rates, and struggle to cope with evolving threat landscap...

Full description

Saved in:
Bibliographic Details
Main Authors: Jianping Wu, Lifeng Chen, Siyuan Fang, Chunming Wu
Format: Article
Language:English
Published: MDPI AG 2024-11-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/14/22/10162
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The traditional methods for identifying sensitive data in APIs mainly encompass rule-based and machine learning-based approaches. However, these methods suffer from inadequacies in terms of security and robustness, exhibit high false positive rates, and struggle to cope with evolving threat landscapes. This paper proposes a method for detecting sensitive data in APIs based on the Federated Large Language Model (FedAPILLM). This method applies the large language model Qwen2.5 and the LoRA instruction tuning technique within the framework of federated learning (FL) to the field of data security. Under the premise of protecting data privacy, a domain-specific corpus and knowledge base are constructed for pre-training and fine-tuning, resulting in a large language model specifically designed for identifying sensitive data in APIs. This paper conducts comparative experiments involving Llama3 8B, Llama3.1 8B, and Qwen2.5 14B. The results demonstrate that Qwen2.5 14B can achieve similar or better performance levels compared to the Llama3.1 8B model with fewer training iterations.
ISSN:2076-3417