Optimizing forensic file classification: enhancing SFCS with βk hyperparameter tuning
In forensic topical modelling, the α parameter controls the distribution of topics in documents. However, low, high, or incorrect values of α lead to topic sparsity, model overfitting, and suboptimal topic distribution. To control the word distribution across topics, the β parameter is introduced. H...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
PeerJ Inc.
2025-03-01
|
| Series: | PeerJ Computer Science |
| Subjects: | |
| Online Access: | https://peerj.com/articles/cs-2608.pdf |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850033585244864512 |
|---|---|
| author | D. Paul Joseph Viswanathan Perumal |
| author_facet | D. Paul Joseph Viswanathan Perumal |
| author_sort | D. Paul Joseph |
| collection | DOAJ |
| description | In forensic topical modelling, the α parameter controls the distribution of topics in documents. However, low, high, or incorrect values of α lead to topic sparsity, model overfitting, and suboptimal topic distribution. To control the word distribution across topics, the β parameter is introduced. However, low, high, or inappropriate β values lead to sparse distribution, disjointed topics, and abundant highly probable words. The βj parameter, in conjunction with seed-guided words based on Term Frequency and Inverse Document Frequency, is introduced to address the issues. Nevertheless, the data often suffers from skewness or noise due to frequent co-occurrences of unrelated polysemic word pairs generated using Pointwise Mutual Information. By integrating α, β, and βj into file classification systems, classification models converge to local optima with O(n log n* |V|) time complexity. To combat these challenges, this research proposes the SDOT Forensic Classification System (SFCS) with a functional parameter βk that identifies seed words by evaluating semantic and contextual similarity of word vectors. As a result, the topic distribution (Θd) is compelled to model the curated seed words within the distribution, generating pertinent topics. Incorporating βk into SFCS allowed the proposed model to remove 278 k irrelevant files from the corpus and identify 5.6 k suspicious files by extracting 700 blacklisted keywords. Furthermore, this research implemented hyperparameter optimization and hyperplane maximization, resulting in a file classification accuracy of 94.6%, 94.4% precision and 96.8% recall within O(n log n) complexity. |
| format | Article |
| id | doaj-art-e78e97acd6714115981f349b22c03c4a |
| institution | DOAJ |
| issn | 2376-5992 |
| language | English |
| publishDate | 2025-03-01 |
| publisher | PeerJ Inc. |
| record_format | Article |
| series | PeerJ Computer Science |
| spelling | doaj-art-e78e97acd6714115981f349b22c03c4a2025-08-20T02:58:10ZengPeerJ Inc.PeerJ Computer Science2376-59922025-03-0111e260810.7717/peerj-cs.2608Optimizing forensic file classification: enhancing SFCS with βk hyperparameter tuningD. Paul Joseph0Viswanathan Perumal1School of Computer Science Engineering and Information Systems, Vellore Institute of Technology University, Vellore, Tamilnadu, IndiaDepartment of IoT, School of Computer Science and Engineering, Vellore Institute of Technology University, Vellore, Tamilnadu, IndiaIn forensic topical modelling, the α parameter controls the distribution of topics in documents. However, low, high, or incorrect values of α lead to topic sparsity, model overfitting, and suboptimal topic distribution. To control the word distribution across topics, the β parameter is introduced. However, low, high, or inappropriate β values lead to sparse distribution, disjointed topics, and abundant highly probable words. The βj parameter, in conjunction with seed-guided words based on Term Frequency and Inverse Document Frequency, is introduced to address the issues. Nevertheless, the data often suffers from skewness or noise due to frequent co-occurrences of unrelated polysemic word pairs generated using Pointwise Mutual Information. By integrating α, β, and βj into file classification systems, classification models converge to local optima with O(n log n* |V|) time complexity. To combat these challenges, this research proposes the SDOT Forensic Classification System (SFCS) with a functional parameter βk that identifies seed words by evaluating semantic and contextual similarity of word vectors. As a result, the topic distribution (Θd) is compelled to model the curated seed words within the distribution, generating pertinent topics. Incorporating βk into SFCS allowed the proposed model to remove 278 k irrelevant files from the corpus and identify 5.6 k suspicious files by extracting 700 blacklisted keywords. Furthermore, this research implemented hyperparameter optimization and hyperplane maximization, resulting in a file classification accuracy of 94.6%, 94.4% precision and 96.8% recall within O(n log n) complexity.https://peerj.com/articles/cs-2608.pdfDigital forensicsDisc forensicsMetadataBlacklisted keywordsForensic data classificationForensic seed words |
| spellingShingle | D. Paul Joseph Viswanathan Perumal Optimizing forensic file classification: enhancing SFCS with βk hyperparameter tuning PeerJ Computer Science Digital forensics Disc forensics Metadata Blacklisted keywords Forensic data classification Forensic seed words |
| title | Optimizing forensic file classification: enhancing SFCS with βk hyperparameter tuning |
| title_full | Optimizing forensic file classification: enhancing SFCS with βk hyperparameter tuning |
| title_fullStr | Optimizing forensic file classification: enhancing SFCS with βk hyperparameter tuning |
| title_full_unstemmed | Optimizing forensic file classification: enhancing SFCS with βk hyperparameter tuning |
| title_short | Optimizing forensic file classification: enhancing SFCS with βk hyperparameter tuning |
| title_sort | optimizing forensic file classification enhancing sfcs with βk hyperparameter tuning |
| topic | Digital forensics Disc forensics Metadata Blacklisted keywords Forensic data classification Forensic seed words |
| url | https://peerj.com/articles/cs-2608.pdf |
| work_keys_str_mv | AT dpauljoseph optimizingforensicfileclassificationenhancingsfcswithbkhyperparametertuning AT viswanathanperumal optimizingforensicfileclassificationenhancingsfcswithbkhyperparametertuning |