Effective Context-Aware File Path Embeddings for Anomaly Detection

In digital forensics, especially Windows forensics, identifying anomalous file paths is crucial when dealing with large-scale data. Traditional static embedding methods, which aggregate token-level representations, discard hierarchical and sequential relationships in file paths, leading to misclassi...

Full description

Saved in:
Bibliographic Details
Main Authors: Ra-Kyung Lee, Hyun-Min Song, Taek-Young Youn
Format: Article
Language:English
Published: MDPI AG 2025-05-01
Series:Systems
Subjects:
Online Access:https://www.mdpi.com/2079-8954/13/6/403
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849467255076683776
author Ra-Kyung Lee
Hyun-Min Song
Taek-Young Youn
author_facet Ra-Kyung Lee
Hyun-Min Song
Taek-Young Youn
author_sort Ra-Kyung Lee
collection DOAJ
description In digital forensics, especially Windows forensics, identifying anomalous file paths is crucial when dealing with large-scale data. Traditional static embedding methods, which aggregate token-level representations, discard hierarchical and sequential relationships in file paths, leading to misclassification of anomalies. This study introduces a Transformer-based sequence modeling approach to classify anomalous file paths, addressing these limitations by preserving positional and contextual relationships. File paths from the NTFS Master File Table (MFT) were embedded using FastText to capture structural and contextual dependencies. Unlike static embeddings, the proposed method processes file paths as structured sequences to enhance anomaly detection accuracy. Extensive experiments showed that Transformer models generally outperformed traditional methods in detecting structured anomalies. The Transformer model with FastText embeddings (32 dimensions) achieved an accuracy of 0.9781 and an F1-score of 0.9782, while Random Forest with FastText embeddings (64 dimensions) achieved an accuracy of 0.9729 and an F1-score of 0.9729. These findings suggest that a hybrid anomaly detection framework combining Transformer-based models with traditional techniques could enhance robustness in forensic investigations. Future research should explore combining both methods to improve adaptability across diverse forensic scenarios.
format Article
id doaj-art-56b7a87c90c0471c86df9d653e375bde
institution Kabale University
issn 2079-8954
language English
publishDate 2025-05-01
publisher MDPI AG
record_format Article
series Systems
spelling doaj-art-56b7a87c90c0471c86df9d653e375bde2025-08-20T03:27:43ZengMDPI AGSystems2079-89542025-05-0113640310.3390/systems13060403Effective Context-Aware File Path Embeddings for Anomaly DetectionRa-Kyung Lee0Hyun-Min Song1Taek-Young Youn2Department of Cyber Security, Dankook University, Jukjeon-ro 152, Yongin-si 16890, Republic of KoreaDepartment of Cyber Security, Dankook University, Jukjeon-ro 152, Yongin-si 16890, Republic of KoreaDepartment of Cyber Security, Dankook University, Jukjeon-ro 152, Yongin-si 16890, Republic of KoreaIn digital forensics, especially Windows forensics, identifying anomalous file paths is crucial when dealing with large-scale data. Traditional static embedding methods, which aggregate token-level representations, discard hierarchical and sequential relationships in file paths, leading to misclassification of anomalies. This study introduces a Transformer-based sequence modeling approach to classify anomalous file paths, addressing these limitations by preserving positional and contextual relationships. File paths from the NTFS Master File Table (MFT) were embedded using FastText to capture structural and contextual dependencies. Unlike static embeddings, the proposed method processes file paths as structured sequences to enhance anomaly detection accuracy. Extensive experiments showed that Transformer models generally outperformed traditional methods in detecting structured anomalies. The Transformer model with FastText embeddings (32 dimensions) achieved an accuracy of 0.9781 and an F1-score of 0.9782, while Random Forest with FastText embeddings (64 dimensions) achieved an accuracy of 0.9729 and an F1-score of 0.9729. These findings suggest that a hybrid anomaly detection framework combining Transformer-based models with traditional techniques could enhance robustness in forensic investigations. Future research should explore combining both methods to improve adaptability across diverse forensic scenarios.https://www.mdpi.com/2079-8954/13/6/403digital forensicsfile path analysissequence modelingword embeddingsanomaly detection
spellingShingle Ra-Kyung Lee
Hyun-Min Song
Taek-Young Youn
Effective Context-Aware File Path Embeddings for Anomaly Detection
Systems
digital forensics
file path analysis
sequence modeling
word embeddings
anomaly detection
title Effective Context-Aware File Path Embeddings for Anomaly Detection
title_full Effective Context-Aware File Path Embeddings for Anomaly Detection
title_fullStr Effective Context-Aware File Path Embeddings for Anomaly Detection
title_full_unstemmed Effective Context-Aware File Path Embeddings for Anomaly Detection
title_short Effective Context-Aware File Path Embeddings for Anomaly Detection
title_sort effective context aware file path embeddings for anomaly detection
topic digital forensics
file path analysis
sequence modeling
word embeddings
anomaly detection
url https://www.mdpi.com/2079-8954/13/6/403
work_keys_str_mv AT rakyunglee effectivecontextawarefilepathembeddingsforanomalydetection
AT hyunminsong effectivecontextawarefilepathembeddingsforanomalydetection
AT taekyoungyoun effectivecontextawarefilepathembeddingsforanomalydetection