On the Effectiveness of Automatic Code Generation for Synthetic Dataset Creation
This paper compares synthetic and real-world code datasets for machine learning applications in cybersecurity by examining the relationships between machine code and Low-Level Virtual Machine Intermediate Representation (LLVM IR). This study analyzes 1000 randomly generated programs from a compiler...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11072675/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | This paper compares synthetic and real-world code datasets for machine learning applications in cybersecurity by examining the relationships between machine code and Low-Level Virtual Machine Intermediate Representation (LLVM IR). This study analyzes 1000 randomly generated programs from a compiler fuzzer against 1000 randomly selected samples from AnghaBench to evaluate suitability for security analysis tasks. Statistical analysis revealed that the code generated with fuzzers consistently produces more complex instruction patterns and achieves broader coverage of the available instruction sets, when compared to real-world samples, with statistically significant differences across all measured categories (<inline-formula> <tex-math notation="LaTeX">$p \lt 0.001$ </tex-math></inline-formula>). The research examines instruction distributions, coverage metrics, program complexity, and statistical properties to characterize synthetic and real-world code differences. Our findings have important implications for vulnerability detection and malware analysis systems, and the research shows that synthetic data generation can effectively complement or potentially surpass real-world samples. These insights help security researchers and practitioners select training datasets for machine learning applications in cybersecurity. |
|---|---|
| ISSN: | 2169-3536 |