An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos
In this paper, a novel and robust deep spatio-textural acoustic feature ensemble-assisted learning environment is proposed for violence detection in surveillance videos (DestaVNet). As the name indicates, the proposed DestaVNet model exploits visual and acoustic features to perform violence detectio...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-06-01
|
| Series: | Engineering Science and Technology, an International Journal |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S2215098625001053 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850188107959238656 |
|---|---|
| author | Duba Sriveni Dr.Loganathan R |
| author_facet | Duba Sriveni Dr.Loganathan R |
| author_sort | Duba Sriveni |
| collection | DOAJ |
| description | In this paper, a novel and robust deep spatio-textural acoustic feature ensemble-assisted learning environment is proposed for violence detection in surveillance videos (DestaVNet). As the name indicates, the proposed DestaVNet model exploits visual and acoustic features to perform violence detection. Additionally, to ensure the scalability of the solution, it employs an active learning concept that retains optimally sufficient frames for further computation and thus reduces computational costs decisively. More specifically, the DestaVNet model initially splits input surveillance footage into acoustic and video frames, followed by multi-constraints active learning based on the most representative frame selection. It applied the least confidence (LC), entropy margin (EM), and margin sampling (MS) criteria to retain the optimal frames for further feature extraction. The DestaVNet model executes pre-processing and feature extraction separately over the frames and corresponding acoustic signals. It performs intensity equalization, histogram equalization, resizing and z-score normalization as pre-processing task, which is followed by deep spatio-textural feature extraction by using gray level co-occurrence matrix (GLCM), ResNet101 and SqueezeNet deep networks. On the other hand, the different acoustic features, including mel-frequency cepstral coefficient (MFCC), gammatone cepstral coefficient (GTCC), GTCC-Δ, harmonic to noise ratio (HNR), spectral features and pitch were obtained. These acoustic and spatio-textural features were fused to yield a composite audio-visual feature set, which was later processed for principal component analysis (PCA) to minimize redundancy, and k-NN as part of an ensemble classifier to enhance prediction accuracy, achieving superior performance. The z-score normalization was performed to alleviate the over-fitting problem. Finally, the retained feature sets were processed for two-class classification by using a heterogeneous ensemble learning model, embodying SVM, DT, k-NN, NB, and RF classifiers. Simulation results confirmed that the proposed DestaVNet model outperforms other existing violence prediction methods, where its superiority was affirmed in terms of the (99.92%), precision (99.67%), recall (99.29%) and F-Measure (0.992). |
| format | Article |
| id | doaj-art-c1c412bc244a4c1fb3e4d2af3a5257d7 |
| institution | OA Journals |
| issn | 2215-0986 |
| language | English |
| publishDate | 2025-06-01 |
| publisher | Elsevier |
| record_format | Article |
| series | Engineering Science and Technology, an International Journal |
| spelling | doaj-art-c1c412bc244a4c1fb3e4d2af3a5257d72025-08-20T02:15:58ZengElsevierEngineering Science and Technology, an International Journal2215-09862025-06-016610205010.1016/j.jestch.2025.102050An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videosDuba Sriveni0Dr.Loganathan R1Research Scholar in Computer Science and Engineering, HKBK College of Engineering -Research Center, Visveswaraya Technological university, Karnataka, India; Assistant Professor, Deaprtment of Computer Science and Engineering, CVR College of Engineering, Hyderabad-501510, India; Corresponding author at: Research Scholar, Visvesvaraya Technological University, Belgavi, India; Assistant Professor, Department of CSE, CVR College of Engineering, Hyderabad, India.Professor & HOD, Department of Computer Science and Engineering, Vijaya Vittala Institute of Science and Technology, Bengaluru-560077, IndiaIn this paper, a novel and robust deep spatio-textural acoustic feature ensemble-assisted learning environment is proposed for violence detection in surveillance videos (DestaVNet). As the name indicates, the proposed DestaVNet model exploits visual and acoustic features to perform violence detection. Additionally, to ensure the scalability of the solution, it employs an active learning concept that retains optimally sufficient frames for further computation and thus reduces computational costs decisively. More specifically, the DestaVNet model initially splits input surveillance footage into acoustic and video frames, followed by multi-constraints active learning based on the most representative frame selection. It applied the least confidence (LC), entropy margin (EM), and margin sampling (MS) criteria to retain the optimal frames for further feature extraction. The DestaVNet model executes pre-processing and feature extraction separately over the frames and corresponding acoustic signals. It performs intensity equalization, histogram equalization, resizing and z-score normalization as pre-processing task, which is followed by deep spatio-textural feature extraction by using gray level co-occurrence matrix (GLCM), ResNet101 and SqueezeNet deep networks. On the other hand, the different acoustic features, including mel-frequency cepstral coefficient (MFCC), gammatone cepstral coefficient (GTCC), GTCC-Δ, harmonic to noise ratio (HNR), spectral features and pitch were obtained. These acoustic and spatio-textural features were fused to yield a composite audio-visual feature set, which was later processed for principal component analysis (PCA) to minimize redundancy, and k-NN as part of an ensemble classifier to enhance prediction accuracy, achieving superior performance. The z-score normalization was performed to alleviate the over-fitting problem. Finally, the retained feature sets were processed for two-class classification by using a heterogeneous ensemble learning model, embodying SVM, DT, k-NN, NB, and RF classifiers. Simulation results confirmed that the proposed DestaVNet model outperforms other existing violence prediction methods, where its superiority was affirmed in terms of the (99.92%), precision (99.67%), recall (99.29%) and F-Measure (0.992).http://www.sciencedirect.com/science/article/pii/S2215098625001053Violence predictionVideo AnalyticsDeep Spatio-Textural Acoustic FeaturesAudio-Visual FeaturesDeep LearningSignal Processing |
| spellingShingle | Duba Sriveni Dr.Loganathan R An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos Engineering Science and Technology, an International Journal Violence prediction Video Analytics Deep Spatio-Textural Acoustic Features Audio-Visual Features Deep Learning Signal Processing |
| title | An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos |
| title_full | An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos |
| title_fullStr | An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos |
| title_full_unstemmed | An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos |
| title_short | An active learning driven deep spatio-textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos |
| title_sort | active learning driven deep spatio textural acoustic feature ensemble assisted learning environment for violence detection in surveillance videos |
| topic | Violence prediction Video Analytics Deep Spatio-Textural Acoustic Features Audio-Visual Features Deep Learning Signal Processing |
| url | http://www.sciencedirect.com/science/article/pii/S2215098625001053 |
| work_keys_str_mv | AT dubasriveni anactivelearningdrivendeepspatiotexturalacousticfeatureensembleassistedlearningenvironmentforviolencedetectioninsurveillancevideos AT drloganathanr anactivelearningdrivendeepspatiotexturalacousticfeatureensembleassistedlearningenvironmentforviolencedetectioninsurveillancevideos AT dubasriveni activelearningdrivendeepspatiotexturalacousticfeatureensembleassistedlearningenvironmentforviolencedetectioninsurveillancevideos AT drloganathanr activelearningdrivendeepspatiotexturalacousticfeatureensembleassistedlearningenvironmentforviolencedetectioninsurveillancevideos |