An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos

In this paper, a novel and robust deep spatio-textural acoustic feature ensemble-assisted learning environment is proposed for violence detection in surveillance videos (DestaVNet). As the name indicates, the proposed DestaVNet model exploits visual and acoustic features to perform violence detectio...

Full description

Saved in:
Bibliographic Details
Main Authors: Duba Sriveni, Dr.Loganathan R
Format: Article
Language:English
Published: Elsevier 2025-06-01
Series:Engineering Science and Technology, an International Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2215098625001053
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In this paper, a novel and robust deep spatio-textural acoustic feature ensemble-assisted learning environment is proposed for violence detection in surveillance videos (DestaVNet). As the name indicates, the proposed DestaVNet model exploits visual and acoustic features to perform violence detection. Additionally, to ensure the scalability of the solution, it employs an active learning concept that retains optimally sufficient frames for further computation and thus reduces computational costs decisively. More specifically, the DestaVNet model initially splits input surveillance footage into acoustic and video frames, followed by multi-constraints active learning based on the most representative frame selection. It applied the least confidence (LC), entropy margin (EM), and margin sampling (MS) criteria to retain the optimal frames for further feature extraction. The DestaVNet model executes pre-processing and feature extraction separately over the frames and corresponding acoustic signals. It performs intensity equalization, histogram equalization, resizing and z-score normalization as pre-processing task, which is followed by deep spatio-textural feature extraction by using gray level co-occurrence matrix (GLCM), ResNet101 and SqueezeNet deep networks. On the other hand, the different acoustic features, including mel-frequency cepstral coefficient (MFCC), gammatone cepstral coefficient (GTCC), GTCC-Δ, harmonic to noise ratio (HNR), spectral features and pitch were obtained. These acoustic and spatio-textural features were fused to yield a composite audio-visual feature set, which was later processed for principal component analysis (PCA) to minimize redundancy, and k-NN as part of an ensemble classifier to enhance prediction accuracy, achieving superior performance. The z-score normalization was performed to alleviate the over-fitting problem. Finally, the retained feature sets were processed for two-class classification by using a heterogeneous ensemble learning model, embodying SVM, DT, k-NN, NB, and RF classifiers. Simulation results confirmed that the proposed DestaVNet model outperforms other existing violence prediction methods, where its superiority was affirmed in terms of the (99.92%), precision (99.67%), recall (99.29%) and F-Measure (0.992).
ISSN:2215-0986