Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach

The increasing prevalence of malicious Microsoft Office documents poses a significant threat to cybersecurity. Conventional methods of detecting these malicious documents often rely on prior knowledge of the document or the exploitation method employed, thus enabling the use of signature-based or ru...

Full description

Saved in:

Bibliographic Details
Main Authors:	Jonas Heß , Kalman Graffi
Format:	Article
Language:	English
Published:	MDPI AG 2025-06-01
Series:	Journal of Cybersecurity and Privacy
Subjects:	cybersecurity malware malicious documents AI
Online Access:	https://www.mdpi.com/2624-800X/5/2/32
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849431392301088768
author	Jonas Heß Kalman Graffi
author_facet	Jonas Heß Kalman Graffi
author_sort	Jonas Heß
collection	DOAJ
description	The increasing prevalence of malicious Microsoft Office documents poses a significant threat to cybersecurity. Conventional methods of detecting these malicious documents often rely on prior knowledge of the document or the exploitation method employed, thus enabling the use of signature-based or rule-based approaches. Given the accelerated pace of change in the threat landscape, these methods are unable to adapt effectively to the evolving environment. Existing machine learning approaches are capable of identifying sophisticated features that enable the prediction of a file’s nature, achieving sufficient results on existing samples. However, they are seldom adequately prepared for the detection of new, advanced malware techniques. This paper proposes a novel approach to detecting malicious Microsoft Office documents by leveraging the power of large language models (LLMs). The method involves extracting textual content from Office documents and utilising advanced natural language processing techniques provided by LLMs to analyse the documents for potentially malicious indicators. As a supplementary tool to contemporary antivirus software, it is currently able to assist in the analysis of malicious Microsoft Office documents by identifying and summarising potentially malicious indicators with a foundation in evidence, which may prove to be more effective with advancing technology and soon to surpass tailored machine learning algorithms, even without the utilisation of signatures and detection rules. As such, it is not limited to Office Open XML documents, but can be applied to any maliciously exploitable file format. The extensive knowledge base and rapid analytical abilities of a large language model enable not only the assessment of extracted evidence but also the contextualisation and referencing of information to support the final decision. We demonstrate that Claude 3.5 Sonnet by Anthropic, provided with a substantial quantity of raw data, equivalent to several hundred pages, can identify individual malicious indicators within an average of five to nine seconds and generate a comprehensive static analysis report, with an average cost of USD 0.19 per request and an F1-score of 0.929.
format	Article
id	doaj-art-9cc193e5a18e46bbb14a547507bc7a3e
institution	Kabale University
issn	2624-800X
language	English
publishDate	2025-06-01
publisher	MDPI AG
record_format	Article
series	Journal of Cybersecurity and Privacy
spelling	doaj-art-9cc193e5a18e46bbb14a547507bc7a3e2025-08-20T03:27:40ZengMDPI AGJournal of Cybersecurity and Privacy2624-800X2025-06-01523210.3390/jcp5020032Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis ApproachJonas Heß 0Kalman Graffi1Faculty of Computer Science, Bingen Technical University of Applied Sciences, 55411 Bingen, GermanyFaculty of Computer Science, Bingen Technical University of Applied Sciences, 55411 Bingen, GermanyThe increasing prevalence of malicious Microsoft Office documents poses a significant threat to cybersecurity. Conventional methods of detecting these malicious documents often rely on prior knowledge of the document or the exploitation method employed, thus enabling the use of signature-based or rule-based approaches. Given the accelerated pace of change in the threat landscape, these methods are unable to adapt effectively to the evolving environment. Existing machine learning approaches are capable of identifying sophisticated features that enable the prediction of a file’s nature, achieving sufficient results on existing samples. However, they are seldom adequately prepared for the detection of new, advanced malware techniques. This paper proposes a novel approach to detecting malicious Microsoft Office documents by leveraging the power of large language models (LLMs). The method involves extracting textual content from Office documents and utilising advanced natural language processing techniques provided by LLMs to analyse the documents for potentially malicious indicators. As a supplementary tool to contemporary antivirus software, it is currently able to assist in the analysis of malicious Microsoft Office documents by identifying and summarising potentially malicious indicators with a foundation in evidence, which may prove to be more effective with advancing technology and soon to surpass tailored machine learning algorithms, even without the utilisation of signatures and detection rules. As such, it is not limited to Office Open XML documents, but can be applied to any maliciously exploitable file format. The extensive knowledge base and rapid analytical abilities of a large language model enable not only the assessment of extracted evidence but also the contextualisation and referencing of information to support the final decision. We demonstrate that Claude 3.5 Sonnet by Anthropic, provided with a substantial quantity of raw data, equivalent to several hundred pages, can identify individual malicious indicators within an average of five to nine seconds and generate a comprehensive static analysis report, with an average cost of USD 0.19 per request and an F1-score of 0.929.https://www.mdpi.com/2624-800X/5/2/32cybersecuritymalwaremalicious documentsAI
spellingShingle	Jonas Heß Kalman Graffi Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach Journal of Cybersecurity and Privacy cybersecurity malware malicious documents AI
title	Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach
title_full	Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach
title_fullStr	Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach
title_full_unstemmed	Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach
title_short	Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach
title_sort	detection of malicious office open documents ooxml using large language models a static analysis approach
topic	cybersecurity malware malicious documents AI
url	https://www.mdpi.com/2624-800X/5/2/32
work_keys_str_mv	AT jonasheß detectionofmaliciousofficeopendocumentsooxmlusinglargelanguagemodelsastaticanalysisapproach AT kalmangraffi detectionofmaliciousofficeopendocumentsooxmlusinglargelanguagemodelsastaticanalysisapproach

Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach

Similar Items