Optimizing forensic file classification: enhancing SFCS with βk hyperparameter tuning

In forensic topical modelling, the α parameter controls the distribution of topics in documents. However, low, high, or incorrect values of α lead to topic sparsity, model overfitting, and suboptimal topic distribution. To control the word distribution across topics, the β parameter is introduced. H...

Full description

Saved in:
Bibliographic Details
Main Authors: D. Paul Joseph, Viswanathan Perumal
Format: Article
Language:English
Published: PeerJ Inc. 2025-03-01
Series:PeerJ Computer Science
Subjects:
Online Access:https://peerj.com/articles/cs-2608.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850033585244864512
author D. Paul Joseph
Viswanathan Perumal
author_facet D. Paul Joseph
Viswanathan Perumal
author_sort D. Paul Joseph
collection DOAJ
description In forensic topical modelling, the α parameter controls the distribution of topics in documents. However, low, high, or incorrect values of α lead to topic sparsity, model overfitting, and suboptimal topic distribution. To control the word distribution across topics, the β parameter is introduced. However, low, high, or inappropriate β values lead to sparse distribution, disjointed topics, and abundant highly probable words. The βj parameter, in conjunction with seed-guided words based on Term Frequency and Inverse Document Frequency, is introduced to address the issues. Nevertheless, the data often suffers from skewness or noise due to frequent co-occurrences of unrelated polysemic word pairs generated using Pointwise Mutual Information. By integrating α, β, and βj into file classification systems, classification models converge to local optima with O(n log n* |V|) time complexity. To combat these challenges, this research proposes the SDOT Forensic Classification System (SFCS) with a functional parameter βk that identifies seed words by evaluating semantic and contextual similarity of word vectors. As a result, the topic distribution (Θd) is compelled to model the curated seed words within the distribution, generating pertinent topics. Incorporating βk into SFCS allowed the proposed model to remove 278 k irrelevant files from the corpus and identify 5.6 k suspicious files by extracting 700 blacklisted keywords. Furthermore, this research implemented hyperparameter optimization and hyperplane maximization, resulting in a file classification accuracy of 94.6%, 94.4% precision and 96.8% recall within O(n log n) complexity.
format Article
id doaj-art-e78e97acd6714115981f349b22c03c4a
institution DOAJ
issn 2376-5992
language English
publishDate 2025-03-01
publisher PeerJ Inc.
record_format Article
series PeerJ Computer Science
spelling doaj-art-e78e97acd6714115981f349b22c03c4a2025-08-20T02:58:10ZengPeerJ Inc.PeerJ Computer Science2376-59922025-03-0111e260810.7717/peerj-cs.2608Optimizing forensic file classification: enhancing SFCS with βk hyperparameter tuningD. Paul Joseph0Viswanathan Perumal1School of Computer Science Engineering and Information Systems, Vellore Institute of Technology University, Vellore, Tamilnadu, IndiaDepartment of IoT, School of Computer Science and Engineering, Vellore Institute of Technology University, Vellore, Tamilnadu, IndiaIn forensic topical modelling, the α parameter controls the distribution of topics in documents. However, low, high, or incorrect values of α lead to topic sparsity, model overfitting, and suboptimal topic distribution. To control the word distribution across topics, the β parameter is introduced. However, low, high, or inappropriate β values lead to sparse distribution, disjointed topics, and abundant highly probable words. The βj parameter, in conjunction with seed-guided words based on Term Frequency and Inverse Document Frequency, is introduced to address the issues. Nevertheless, the data often suffers from skewness or noise due to frequent co-occurrences of unrelated polysemic word pairs generated using Pointwise Mutual Information. By integrating α, β, and βj into file classification systems, classification models converge to local optima with O(n log n* |V|) time complexity. To combat these challenges, this research proposes the SDOT Forensic Classification System (SFCS) with a functional parameter βk that identifies seed words by evaluating semantic and contextual similarity of word vectors. As a result, the topic distribution (Θd) is compelled to model the curated seed words within the distribution, generating pertinent topics. Incorporating βk into SFCS allowed the proposed model to remove 278 k irrelevant files from the corpus and identify 5.6 k suspicious files by extracting 700 blacklisted keywords. Furthermore, this research implemented hyperparameter optimization and hyperplane maximization, resulting in a file classification accuracy of 94.6%, 94.4% precision and 96.8% recall within O(n log n) complexity.https://peerj.com/articles/cs-2608.pdfDigital forensicsDisc forensicsMetadataBlacklisted keywordsForensic data classificationForensic seed words
spellingShingle D. Paul Joseph
Viswanathan Perumal
Optimizing forensic file classification: enhancing SFCS with βk hyperparameter tuning
PeerJ Computer Science
Digital forensics
Disc forensics
Metadata
Blacklisted keywords
Forensic data classification
Forensic seed words
title Optimizing forensic file classification: enhancing SFCS with βk hyperparameter tuning
title_full Optimizing forensic file classification: enhancing SFCS with βk hyperparameter tuning
title_fullStr Optimizing forensic file classification: enhancing SFCS with βk hyperparameter tuning
title_full_unstemmed Optimizing forensic file classification: enhancing SFCS with βk hyperparameter tuning
title_short Optimizing forensic file classification: enhancing SFCS with βk hyperparameter tuning
title_sort optimizing forensic file classification enhancing sfcs with βk hyperparameter tuning
topic Digital forensics
Disc forensics
Metadata
Blacklisted keywords
Forensic data classification
Forensic seed words
url https://peerj.com/articles/cs-2608.pdf
work_keys_str_mv AT dpauljoseph optimizingforensicfileclassificationenhancingsfcswithbkhyperparametertuning
AT viswanathanperumal optimizingforensicfileclassificationenhancingsfcswithbkhyperparametertuning