An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos

In this paper, a novel and robust deep spatio-textural acoustic feature ensemble-assisted learning environment is proposed for violence detection in surveillance videos (DestaVNet). As the name indicates, the proposed DestaVNet model exploits visual and acoustic features to perform violence detectio...

Full description

Saved in:
Bibliographic Details
Main Authors: Duba Sriveni, Dr.Loganathan R
Format: Article
Language:English
Published: Elsevier 2025-06-01
Series:Engineering Science and Technology, an International Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2215098625001053
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850188107959238656
author Duba Sriveni
Dr.Loganathan R
author_facet Duba Sriveni
Dr.Loganathan R
author_sort Duba Sriveni
collection DOAJ
description In this paper, a novel and robust deep spatio-textural acoustic feature ensemble-assisted learning environment is proposed for violence detection in surveillance videos (DestaVNet). As the name indicates, the proposed DestaVNet model exploits visual and acoustic features to perform violence detection. Additionally, to ensure the scalability of the solution, it employs an active learning concept that retains optimally sufficient frames for further computation and thus reduces computational costs decisively. More specifically, the DestaVNet model initially splits input surveillance footage into acoustic and video frames, followed by multi-constraints active learning based on the most representative frame selection. It applied the least confidence (LC), entropy margin (EM), and margin sampling (MS) criteria to retain the optimal frames for further feature extraction. The DestaVNet model executes pre-processing and feature extraction separately over the frames and corresponding acoustic signals. It performs intensity equalization, histogram equalization, resizing and z-score normalization as pre-processing task, which is followed by deep spatio-textural feature extraction by using gray level co-occurrence matrix (GLCM), ResNet101 and SqueezeNet deep networks. On the other hand, the different acoustic features, including mel-frequency cepstral coefficient (MFCC), gammatone cepstral coefficient (GTCC), GTCC-Δ, harmonic to noise ratio (HNR), spectral features and pitch were obtained. These acoustic and spatio-textural features were fused to yield a composite audio-visual feature set, which was later processed for principal component analysis (PCA) to minimize redundancy, and k-NN as part of an ensemble classifier to enhance prediction accuracy, achieving superior performance. The z-score normalization was performed to alleviate the over-fitting problem. Finally, the retained feature sets were processed for two-class classification by using a heterogeneous ensemble learning model, embodying SVM, DT, k-NN, NB, and RF classifiers. Simulation results confirmed that the proposed DestaVNet model outperforms other existing violence prediction methods, where its superiority was affirmed in terms of the (99.92%), precision (99.67%), recall (99.29%) and F-Measure (0.992).
format Article
id doaj-art-c1c412bc244a4c1fb3e4d2af3a5257d7
institution OA Journals
issn 2215-0986
language English
publishDate 2025-06-01
publisher Elsevier
record_format Article
series Engineering Science and Technology, an International Journal
spelling doaj-art-c1c412bc244a4c1fb3e4d2af3a5257d72025-08-20T02:15:58ZengElsevierEngineering Science and Technology, an International Journal2215-09862025-06-016610205010.1016/j.jestch.2025.102050An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videosDuba Sriveni0Dr.Loganathan R1Research Scholar in Computer Science and Engineering, HKBK College of Engineering -Research Center, Visveswaraya Technological university, Karnataka, India; Assistant Professor, Deaprtment of Computer Science and Engineering, CVR College of Engineering, Hyderabad-501510, India; Corresponding author at: Research Scholar, Visvesvaraya Technological University, Belgavi, India; Assistant Professor, Department of CSE, CVR College of Engineering, Hyderabad, India.Professor & HOD, Department of Computer Science and Engineering, Vijaya Vittala Institute of Science and Technology, Bengaluru-560077, IndiaIn this paper, a novel and robust deep spatio-textural acoustic feature ensemble-assisted learning environment is proposed for violence detection in surveillance videos (DestaVNet). As the name indicates, the proposed DestaVNet model exploits visual and acoustic features to perform violence detection. Additionally, to ensure the scalability of the solution, it employs an active learning concept that retains optimally sufficient frames for further computation and thus reduces computational costs decisively. More specifically, the DestaVNet model initially splits input surveillance footage into acoustic and video frames, followed by multi-constraints active learning based on the most representative frame selection. It applied the least confidence (LC), entropy margin (EM), and margin sampling (MS) criteria to retain the optimal frames for further feature extraction. The DestaVNet model executes pre-processing and feature extraction separately over the frames and corresponding acoustic signals. It performs intensity equalization, histogram equalization, resizing and z-score normalization as pre-processing task, which is followed by deep spatio-textural feature extraction by using gray level co-occurrence matrix (GLCM), ResNet101 and SqueezeNet deep networks. On the other hand, the different acoustic features, including mel-frequency cepstral coefficient (MFCC), gammatone cepstral coefficient (GTCC), GTCC-Δ, harmonic to noise ratio (HNR), spectral features and pitch were obtained. These acoustic and spatio-textural features were fused to yield a composite audio-visual feature set, which was later processed for principal component analysis (PCA) to minimize redundancy, and k-NN as part of an ensemble classifier to enhance prediction accuracy, achieving superior performance. The z-score normalization was performed to alleviate the over-fitting problem. Finally, the retained feature sets were processed for two-class classification by using a heterogeneous ensemble learning model, embodying SVM, DT, k-NN, NB, and RF classifiers. Simulation results confirmed that the proposed DestaVNet model outperforms other existing violence prediction methods, where its superiority was affirmed in terms of the (99.92%), precision (99.67%), recall (99.29%) and F-Measure (0.992).http://www.sciencedirect.com/science/article/pii/S2215098625001053Violence predictionVideo AnalyticsDeep Spatio-Textural Acoustic FeaturesAudio-Visual FeaturesDeep LearningSignal Processing
spellingShingle Duba Sriveni
Dr.Loganathan R
An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos
Engineering Science and Technology, an International Journal
Violence prediction
Video Analytics
Deep Spatio-Textural Acoustic Features
Audio-Visual Features
Deep Learning
Signal Processing
title An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos
title_full An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos
title_fullStr An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos
title_full_unstemmed An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos
title_short An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos
title_sort active learning driven deep spatio textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos
topic Violence prediction
Video Analytics
Deep Spatio-Textural Acoustic Features
Audio-Visual Features
Deep Learning
Signal Processing
url http://www.sciencedirect.com/science/article/pii/S2215098625001053
work_keys_str_mv AT dubasriveni anactivelearningdrivendeepspatiotexturalacousticfeatureensembleassistedlearningenvironmentforviolencedetectioninsurveillancevideos
AT drloganathanr anactivelearningdrivendeepspatiotexturalacousticfeatureensembleassistedlearningenvironmentforviolencedetectioninsurveillancevideos
AT dubasriveni activelearningdrivendeepspatiotexturalacousticfeatureensembleassistedlearningenvironmentforviolencedetectioninsurveillancevideos
AT drloganathanr activelearningdrivendeepspatiotexturalacousticfeatureensembleassistedlearningenvironmentforviolencedetectioninsurveillancevideos