An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos

In this paper, a novel and robust deep spatio-textural acoustic feature ensemble-assisted learning environment is proposed for violence detection in surveillance videos (DestaVNet). As the name indicates, the proposed DestaVNet model exploits visual and acoustic features to perform violence detectio...

Full description

Saved in:

Bibliographic Details
Main Authors:	Duba Sriveni, Dr.Loganathan R
Format:	Article
Language:	English
Published:	Elsevier 2025-06-01
Series:	Engineering Science and Technology, an International Journal
Subjects:	Violence prediction Video Analytics Deep Spatio-Textural Acoustic Features Audio-Visual Features Deep Learning Signal Processing
Online Access:	http://www.sciencedirect.com/science/article/pii/S2215098625001053
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850188107959238656
author	Duba Sriveni Dr.Loganathan R
author_facet	Duba Sriveni Dr.Loganathan R
author_sort	Duba Sriveni
collection	DOAJ
description	In this paper, a novel and robust deep spatio-textural acoustic feature ensemble-assisted learning environment is proposed for violence detection in surveillance videos (DestaVNet). As the name indicates, the proposed DestaVNet model exploits visual and acoustic features to perform violence detection. Additionally, to ensure the scalability of the solution, it employs an active learning concept that retains optimally sufficient frames for further computation and thus reduces computational costs decisively. More specifically, the DestaVNet model initially splits input surveillance footage into acoustic and video frames, followed by multi-constraints active learning based on the most representative frame selection. It applied the least confidence (LC), entropy margin (EM), and margin sampling (MS) criteria to retain the optimal frames for further feature extraction. The DestaVNet model executes pre-processing and feature extraction separately over the frames and corresponding acoustic signals. It performs intensity equalization, histogram equalization, resizing and z-score normalization as pre-processing task, which is followed by deep spatio-textural feature extraction by using gray level co-occurrence matrix (GLCM), ResNet101 and SqueezeNet deep networks. On the other hand, the different acoustic features, including mel-frequency cepstral coefficient (MFCC), gammatone cepstral coefficient (GTCC), GTCC-Δ, harmonic to noise ratio (HNR), spectral features and pitch were obtained. These acoustic and spatio-textural features were fused to yield a composite audio-visual feature set, which was later processed for principal component analysis (PCA) to minimize redundancy, and k-NN as part of an ensemble classifier to enhance prediction accuracy, achieving superior performance. The z-score normalization was performed to alleviate the over-fitting problem. Finally, the retained feature sets were processed for two-class classification by using a heterogeneous ensemble learning model, embodying SVM, DT, k-NN, NB, and RF classifiers. Simulation results confirmed that the proposed DestaVNet model outperforms other existing violence prediction methods, where its superiority was affirmed in terms of the (99.92%), precision (99.67%), recall (99.29%) and F-Measure (0.992).
format	Article
id	doaj-art-c1c412bc244a4c1fb3e4d2af3a5257d7
institution	OA Journals
issn	2215-0986
language	English
publishDate	2025-06-01
publisher	Elsevier
record_format	Article
series	Engineering Science and Technology, an International Journal
spelling	doaj-art-c1c412bc244a4c1fb3e4d2af3a5257d72025-08-20T02:15:58ZengElsevierEngineering Science and Technology, an International Journal2215-09862025-06-016610205010.1016/j.jestch.2025.102050An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videosDuba Sriveni0Dr.Loganathan R1Research Scholar in Computer Science and Engineering, HKBK College of Engineering -Research Center, Visveswaraya Technological university, Karnataka, India; Assistant Professor, Deaprtment of Computer Science and Engineering, CVR College of Engineering, Hyderabad-501510, India; Corresponding author at: Research Scholar, Visvesvaraya Technological University, Belgavi, India; Assistant Professor, Department of CSE, CVR College of Engineering, Hyderabad, India.Professor & HOD, Department of Computer Science and Engineering, Vijaya Vittala Institute of Science and Technology, Bengaluru-560077, IndiaIn this paper, a novel and robust deep spatio-textural acoustic feature ensemble-assisted learning environment is proposed for violence detection in surveillance videos (DestaVNet). As the name indicates, the proposed DestaVNet model exploits visual and acoustic features to perform violence detection. Additionally, to ensure the scalability of the solution, it employs an active learning concept that retains optimally sufficient frames for further computation and thus reduces computational costs decisively. More specifically, the DestaVNet model initially splits input surveillance footage into acoustic and video frames, followed by multi-constraints active learning based on the most representative frame selection. It applied the least confidence (LC), entropy margin (EM), and margin sampling (MS) criteria to retain the optimal frames for further feature extraction. The DestaVNet model executes pre-processing and feature extraction separately over the frames and corresponding acoustic signals. It performs intensity equalization, histogram equalization, resizing and z-score normalization as pre-processing task, which is followed by deep spatio-textural feature extraction by using gray level co-occurrence matrix (GLCM), ResNet101 and SqueezeNet deep networks. On the other hand, the different acoustic features, including mel-frequency cepstral coefficient (MFCC), gammatone cepstral coefficient (GTCC), GTCC-Δ, harmonic to noise ratio (HNR), spectral features and pitch were obtained. These acoustic and spatio-textural features were fused to yield a composite audio-visual feature set, which was later processed for principal component analysis (PCA) to minimize redundancy, and k-NN as part of an ensemble classifier to enhance prediction accuracy, achieving superior performance. The z-score normalization was performed to alleviate the over-fitting problem. Finally, the retained feature sets were processed for two-class classification by using a heterogeneous ensemble learning model, embodying SVM, DT, k-NN, NB, and RF classifiers. Simulation results confirmed that the proposed DestaVNet model outperforms other existing violence prediction methods, where its superiority was affirmed in terms of the (99.92%), precision (99.67%), recall (99.29%) and F-Measure (0.992).http://www.sciencedirect.com/science/article/pii/S2215098625001053Violence predictionVideo AnalyticsDeep Spatio-Textural Acoustic FeaturesAudio-Visual FeaturesDeep LearningSignal Processing
spellingShingle	Duba Sriveni Dr.Loganathan R An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos Engineering Science and Technology, an International Journal Violence prediction Video Analytics Deep Spatio-Textural Acoustic Features Audio-Visual Features Deep Learning Signal Processing
title	An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos
title_full	An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos
title_fullStr	An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos
title_full_unstemmed	An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos
title_short	An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos
title_sort	active learning driven deep spatio textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos
topic	Violence prediction Video Analytics Deep Spatio-Textural Acoustic Features Audio-Visual Features Deep Learning Signal Processing
url	http://www.sciencedirect.com/science/article/pii/S2215098625001053
work_keys_str_mv	AT dubasriveni anactivelearningdrivendeepspatiotexturalacousticfeatureensembleassistedlearningenvironmentforviolencedetectioninsurveillancevideos AT drloganathanr anactivelearningdrivendeepspatiotexturalacousticfeatureensembleassistedlearningenvironmentforviolencedetectioninsurveillancevideos AT dubasriveni activelearningdrivendeepspatiotexturalacousticfeatureensembleassistedlearningenvironmentforviolencedetectioninsurveillancevideos AT drloganathanr activelearningdrivendeepspatiotexturalacousticfeatureensembleassistedlearningenvironmentforviolencedetectioninsurveillancevideos

An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos

Similar Items