Large Language Models for Synthetic Dataset Generation of Cybersecurity Indicators of Compromise
In the field of Cyber Threat Intelligence (CTI), the scarcity of high-quality and labelled datasets that include Indicators of Compromise (IoCs) impact the design and implementation of robust predictive models that are capable of classifying IoCs in online communication, specifically in social media...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-04-01
|
| Series: | Sensors |
| Subjects: | |
| Online Access: | https://www.mdpi.com/1424-8220/25/9/2825 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | In the field of Cyber Threat Intelligence (CTI), the scarcity of high-quality and labelled datasets that include Indicators of Compromise (IoCs) impact the design and implementation of robust predictive models that are capable of classifying IoCs in online communication, specifically in social media contexts where users are potentially highly exposed to cyber threats. Thus, the generation of high-quality synthetic datasets can be utilized to fill this gap and develop effective CTI systems. Therefore, this study aims to fine-tune OpenAI’s Large Language Model (LLM), Gpt-3.5, to generate a synthetic dataset that replicates the style of a real social media curated dataset, as well as incorporates select IoCs as domain knowledge. Four machine-learning (ML) and deep-learning (DL) models were evaluated on two generated datasets (one with 4000 instances and the other with 12,000). The results indicated that, on the 4000-instance dataset, the Dense Neural Network (DenseNN) outputs the highest accuracy (77%), while on the 12,000-instance dataset, Logistic Regression (LR) achieved the highest accuracy of 82%. This study highlights the potential of integrating fine-tuned LLMs with domain-specific knowledge to create high-quality synthetic data. The main contribution of this research is in the adoption of fine-tuning of an LLM, Gpt-3.5, using real social media datasets and curated IoC domain knowledge, which is expected to improve the process of synthetic dataset generation and later IoC extraction and classification, offering a realistic and novel resource for cybersecurity applications. |
|---|---|
| ISSN: | 1424-8220 |