Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach

The increasing prevalence of malicious Microsoft Office documents poses a significant threat to cybersecurity. Conventional methods of detecting these malicious documents often rely on prior knowledge of the document or the exploitation method employed, thus enabling the use of signature-based or ru...

Full description

Saved in:
Bibliographic Details
Main Authors: Jonas Heß , Kalman Graffi
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Journal of Cybersecurity and Privacy
Subjects:
Online Access:https://www.mdpi.com/2624-800X/5/2/32
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849431392301088768
author Jonas Heß 
Kalman Graffi
author_facet Jonas Heß 
Kalman Graffi
author_sort Jonas Heß 
collection DOAJ
description The increasing prevalence of malicious Microsoft Office documents poses a significant threat to cybersecurity. Conventional methods of detecting these malicious documents often rely on prior knowledge of the document or the exploitation method employed, thus enabling the use of signature-based or rule-based approaches. Given the accelerated pace of change in the threat landscape, these methods are unable to adapt effectively to the evolving environment. Existing machine learning approaches are capable of identifying sophisticated features that enable the prediction of a file’s nature, achieving sufficient results on existing samples. However, they are seldom adequately prepared for the detection of new, advanced malware techniques. This paper proposes a novel approach to detecting malicious Microsoft Office documents by leveraging the power of large language models (LLMs). The method involves extracting textual content from Office documents and utilising advanced natural language processing techniques provided by LLMs to analyse the documents for potentially malicious indicators. As a supplementary tool to contemporary antivirus software, it is currently able to assist in the analysis of malicious Microsoft Office documents by identifying and summarising potentially malicious indicators with a foundation in evidence, which may prove to be more effective with advancing technology and soon to surpass tailored machine learning algorithms, even without the utilisation of signatures and detection rules. As such, it is not limited to Office Open XML documents, but can be applied to any maliciously exploitable file format. The extensive knowledge base and rapid analytical abilities of a large language model enable not only the assessment of extracted evidence but also the contextualisation and referencing of information to support the final decision. We demonstrate that Claude 3.5 Sonnet by Anthropic, provided with a substantial quantity of raw data, equivalent to several hundred pages, can identify individual malicious indicators within an average of five to nine seconds and generate a comprehensive static analysis report, with an average cost of USD 0.19 per request and an F1-score of 0.929.
format Article
id doaj-art-9cc193e5a18e46bbb14a547507bc7a3e
institution Kabale University
issn 2624-800X
language English
publishDate 2025-06-01
publisher MDPI AG
record_format Article
series Journal of Cybersecurity and Privacy
spelling doaj-art-9cc193e5a18e46bbb14a547507bc7a3e2025-08-20T03:27:40ZengMDPI AGJournal of Cybersecurity and Privacy2624-800X2025-06-01523210.3390/jcp5020032Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis ApproachJonas Heß 0Kalman Graffi1Faculty of Computer Science, Bingen Technical University of Applied Sciences, 55411 Bingen, GermanyFaculty of Computer Science, Bingen Technical University of Applied Sciences, 55411 Bingen, GermanyThe increasing prevalence of malicious Microsoft Office documents poses a significant threat to cybersecurity. Conventional methods of detecting these malicious documents often rely on prior knowledge of the document or the exploitation method employed, thus enabling the use of signature-based or rule-based approaches. Given the accelerated pace of change in the threat landscape, these methods are unable to adapt effectively to the evolving environment. Existing machine learning approaches are capable of identifying sophisticated features that enable the prediction of a file’s nature, achieving sufficient results on existing samples. However, they are seldom adequately prepared for the detection of new, advanced malware techniques. This paper proposes a novel approach to detecting malicious Microsoft Office documents by leveraging the power of large language models (LLMs). The method involves extracting textual content from Office documents and utilising advanced natural language processing techniques provided by LLMs to analyse the documents for potentially malicious indicators. As a supplementary tool to contemporary antivirus software, it is currently able to assist in the analysis of malicious Microsoft Office documents by identifying and summarising potentially malicious indicators with a foundation in evidence, which may prove to be more effective with advancing technology and soon to surpass tailored machine learning algorithms, even without the utilisation of signatures and detection rules. As such, it is not limited to Office Open XML documents, but can be applied to any maliciously exploitable file format. The extensive knowledge base and rapid analytical abilities of a large language model enable not only the assessment of extracted evidence but also the contextualisation and referencing of information to support the final decision. We demonstrate that Claude 3.5 Sonnet by Anthropic, provided with a substantial quantity of raw data, equivalent to several hundred pages, can identify individual malicious indicators within an average of five to nine seconds and generate a comprehensive static analysis report, with an average cost of USD 0.19 per request and an F1-score of 0.929.https://www.mdpi.com/2624-800X/5/2/32cybersecuritymalwaremalicious documentsAI
spellingShingle Jonas Heß 
Kalman Graffi
Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach
Journal of Cybersecurity and Privacy
cybersecurity
malware
malicious documents
AI
title Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach
title_full Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach
title_fullStr Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach
title_full_unstemmed Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach
title_short Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach
title_sort detection of malicious office open documents ooxml using large language models a static analysis approach
topic cybersecurity
malware
malicious documents
AI
url https://www.mdpi.com/2624-800X/5/2/32
work_keys_str_mv AT jonasheß detectionofmaliciousofficeopendocumentsooxmlusinglargelanguagemodelsastaticanalysisapproach
AT kalmangraffi detectionofmaliciousofficeopendocumentsooxmlusinglargelanguagemodelsastaticanalysisapproach