Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach
The increasing prevalence of malicious Microsoft Office documents poses a significant threat to cybersecurity. Conventional methods of detecting these malicious documents often rely on prior knowledge of the document or the exploitation method employed, thus enabling the use of signature-based or ru...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-06-01
|
| Series: | Journal of Cybersecurity and Privacy |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2624-800X/5/2/32 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849431392301088768 |
|---|---|
| author | Jonas Heß Kalman Graffi |
| author_facet | Jonas Heß Kalman Graffi |
| author_sort | Jonas Heß |
| collection | DOAJ |
| description | The increasing prevalence of malicious Microsoft Office documents poses a significant threat to cybersecurity. Conventional methods of detecting these malicious documents often rely on prior knowledge of the document or the exploitation method employed, thus enabling the use of signature-based or rule-based approaches. Given the accelerated pace of change in the threat landscape, these methods are unable to adapt effectively to the evolving environment. Existing machine learning approaches are capable of identifying sophisticated features that enable the prediction of a file’s nature, achieving sufficient results on existing samples. However, they are seldom adequately prepared for the detection of new, advanced malware techniques. This paper proposes a novel approach to detecting malicious Microsoft Office documents by leveraging the power of large language models (LLMs). The method involves extracting textual content from Office documents and utilising advanced natural language processing techniques provided by LLMs to analyse the documents for potentially malicious indicators. As a supplementary tool to contemporary antivirus software, it is currently able to assist in the analysis of malicious Microsoft Office documents by identifying and summarising potentially malicious indicators with a foundation in evidence, which may prove to be more effective with advancing technology and soon to surpass tailored machine learning algorithms, even without the utilisation of signatures and detection rules. As such, it is not limited to Office Open XML documents, but can be applied to any maliciously exploitable file format. The extensive knowledge base and rapid analytical abilities of a large language model enable not only the assessment of extracted evidence but also the contextualisation and referencing of information to support the final decision. We demonstrate that Claude 3.5 Sonnet by Anthropic, provided with a substantial quantity of raw data, equivalent to several hundred pages, can identify individual malicious indicators within an average of five to nine seconds and generate a comprehensive static analysis report, with an average cost of USD 0.19 per request and an F1-score of 0.929. |
| format | Article |
| id | doaj-art-9cc193e5a18e46bbb14a547507bc7a3e |
| institution | Kabale University |
| issn | 2624-800X |
| language | English |
| publishDate | 2025-06-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Journal of Cybersecurity and Privacy |
| spelling | doaj-art-9cc193e5a18e46bbb14a547507bc7a3e2025-08-20T03:27:40ZengMDPI AGJournal of Cybersecurity and Privacy2624-800X2025-06-01523210.3390/jcp5020032Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis ApproachJonas Heß 0Kalman Graffi1Faculty of Computer Science, Bingen Technical University of Applied Sciences, 55411 Bingen, GermanyFaculty of Computer Science, Bingen Technical University of Applied Sciences, 55411 Bingen, GermanyThe increasing prevalence of malicious Microsoft Office documents poses a significant threat to cybersecurity. Conventional methods of detecting these malicious documents often rely on prior knowledge of the document or the exploitation method employed, thus enabling the use of signature-based or rule-based approaches. Given the accelerated pace of change in the threat landscape, these methods are unable to adapt effectively to the evolving environment. Existing machine learning approaches are capable of identifying sophisticated features that enable the prediction of a file’s nature, achieving sufficient results on existing samples. However, they are seldom adequately prepared for the detection of new, advanced malware techniques. This paper proposes a novel approach to detecting malicious Microsoft Office documents by leveraging the power of large language models (LLMs). The method involves extracting textual content from Office documents and utilising advanced natural language processing techniques provided by LLMs to analyse the documents for potentially malicious indicators. As a supplementary tool to contemporary antivirus software, it is currently able to assist in the analysis of malicious Microsoft Office documents by identifying and summarising potentially malicious indicators with a foundation in evidence, which may prove to be more effective with advancing technology and soon to surpass tailored machine learning algorithms, even without the utilisation of signatures and detection rules. As such, it is not limited to Office Open XML documents, but can be applied to any maliciously exploitable file format. The extensive knowledge base and rapid analytical abilities of a large language model enable not only the assessment of extracted evidence but also the contextualisation and referencing of information to support the final decision. We demonstrate that Claude 3.5 Sonnet by Anthropic, provided with a substantial quantity of raw data, equivalent to several hundred pages, can identify individual malicious indicators within an average of five to nine seconds and generate a comprehensive static analysis report, with an average cost of USD 0.19 per request and an F1-score of 0.929.https://www.mdpi.com/2624-800X/5/2/32cybersecuritymalwaremalicious documentsAI |
| spellingShingle | Jonas Heß Kalman Graffi Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach Journal of Cybersecurity and Privacy cybersecurity malware malicious documents AI |
| title | Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach |
| title_full | Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach |
| title_fullStr | Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach |
| title_full_unstemmed | Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach |
| title_short | Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach |
| title_sort | detection of malicious office open documents ooxml using large language models a static analysis approach |
| topic | cybersecurity malware malicious documents AI |
| url | https://www.mdpi.com/2624-800X/5/2/32 |
| work_keys_str_mv | AT jonasheß detectionofmaliciousofficeopendocumentsooxmlusinglargelanguagemodelsastaticanalysisapproach AT kalmangraffi detectionofmaliciousofficeopendocumentsooxmlusinglargelanguagemodelsastaticanalysisapproach |