Design of an Improved Model for Anomaly Detection in CCTV Systems Using Multimodal Fusion and Attention-Based Networks
Traditional approaches for video analysis often misdefine anomalies; they usually rely on single-modality input and have inadequate management of complex temporal patterns. This paper resolves these limitations by proposing a comprehensive scheme for multimodal Closed-Circuit Television (CCTV) video...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10876563/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849775150918008832 |
|---|---|
| author | V. Srilakshmi Sai Babu Veesam Mallu Shiva Rama Krishna Ravi Kumar Munaganuri Dulam Devee Sivaprasad |
| author_facet | V. Srilakshmi Sai Babu Veesam Mallu Shiva Rama Krishna Ravi Kumar Munaganuri Dulam Devee Sivaprasad |
| author_sort | V. Srilakshmi |
| collection | DOAJ |
| description | Traditional approaches for video analysis often misdefine anomalies; they usually rely on single-modality input and have inadequate management of complex temporal patterns. This paper resolves these limitations by proposing a comprehensive scheme for multimodal Closed-Circuit Television (CCTV) video analysis. The utilized techniques in this paper comprise the Multimodal Deep Boltzmann Machine (MDBM), Multimodal Variational Autoencoder (MVAE) and Attention-based Fusion Networks, all of which fully utilize the learned representations. MDBM learns shared representations out of heterogeneous data sources, MVAE captures the inherent distribution of multi-modalities, while the mechanism of attention in fusion networks is done to stress important features. Finally, temporal context is modeled using long short-term memory and transformer networks, temporal convolutional networks and transformer networks with temporal encoding. Long Short-Term Memory (LSTM) can capture long-range dependencies in sequential data, while Temporal Convolutional Network (TCN) efficiently models temporal patterns using convolutional layers and Transformer Networks fathom the relative importance of temporal features against one another through self-attention, thus improving their detection accuracy for anomalies that happen over a long duration. The proposed models also offer good improvements in the performance of anomaly detection. In particular, accuracy improved by 5% using MDBM, the false positive rate reduced by 15% with MVAE, a more than 10% improvement in the F1-score with the attentive fusion network, a 20% reduction in reconstruction error with Deep Convolutional Autoencoder (DCA), detection precision improved by 12% using Adversarially Learned Inference (ALI) and a gain of 8% in Area Under the Curve (AUC) using Deep InfoMax (DIM) operations. |
| format | Article |
| id | doaj-art-a738371587e642d080f9bda39a4bf619 |
| institution | DOAJ |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-a738371587e642d080f9bda39a4bf6192025-08-20T03:01:31ZengIEEEIEEE Access2169-35362025-01-0113272872730910.1109/ACCESS.2025.353650110876563Design of an Improved Model for Anomaly Detection in CCTV Systems Using Multimodal Fusion and Attention-Based NetworksV. Srilakshmi0https://orcid.org/0000-0002-2058-0781Sai Babu Veesam1https://orcid.org/0009-0000-5473-4681Mallu Shiva Rama Krishna2https://orcid.org/0009-0007-8950-0288Ravi Kumar Munaganuri3https://orcid.org/0000-0001-6629-2315Dulam Devee Sivaprasad4School of Computer Science and Engineering, VIT-AP University, Amaravati, IndiaSchool of Computer Science and Engineering, VIT-AP University, Amaravati, IndiaSchool of Computer Science and Engineering, VIT-AP University, Amaravati, IndiaSchool of Computer Science and Engineering, VIT-AP University, Amaravati, IndiaSchool of Computer Science and Engineering, VIT-AP University, Amaravati, IndiaTraditional approaches for video analysis often misdefine anomalies; they usually rely on single-modality input and have inadequate management of complex temporal patterns. This paper resolves these limitations by proposing a comprehensive scheme for multimodal Closed-Circuit Television (CCTV) video analysis. The utilized techniques in this paper comprise the Multimodal Deep Boltzmann Machine (MDBM), Multimodal Variational Autoencoder (MVAE) and Attention-based Fusion Networks, all of which fully utilize the learned representations. MDBM learns shared representations out of heterogeneous data sources, MVAE captures the inherent distribution of multi-modalities, while the mechanism of attention in fusion networks is done to stress important features. Finally, temporal context is modeled using long short-term memory and transformer networks, temporal convolutional networks and transformer networks with temporal encoding. Long Short-Term Memory (LSTM) can capture long-range dependencies in sequential data, while Temporal Convolutional Network (TCN) efficiently models temporal patterns using convolutional layers and Transformer Networks fathom the relative importance of temporal features against one another through self-attention, thus improving their detection accuracy for anomalies that happen over a long duration. The proposed models also offer good improvements in the performance of anomaly detection. In particular, accuracy improved by 5% using MDBM, the false positive rate reduced by 15% with MVAE, a more than 10% improvement in the F1-score with the attentive fusion network, a 20% reduction in reconstruction error with Deep Convolutional Autoencoder (DCA), detection precision improved by 12% using Adversarially Learned Inference (ALI) and a gain of 8% in Area Under the Curve (AUC) using Deep InfoMax (DIM) operations.https://ieeexplore.ieee.org/document/10876563/Anomaly detectiondeep learningmultimodal fusiontemporal context modelingunsupervised learning |
| spellingShingle | V. Srilakshmi Sai Babu Veesam Mallu Shiva Rama Krishna Ravi Kumar Munaganuri Dulam Devee Sivaprasad Design of an Improved Model for Anomaly Detection in CCTV Systems Using Multimodal Fusion and Attention-Based Networks IEEE Access Anomaly detection deep learning multimodal fusion temporal context modeling unsupervised learning |
| title | Design of an Improved Model for Anomaly Detection in CCTV Systems Using Multimodal Fusion and Attention-Based Networks |
| title_full | Design of an Improved Model for Anomaly Detection in CCTV Systems Using Multimodal Fusion and Attention-Based Networks |
| title_fullStr | Design of an Improved Model for Anomaly Detection in CCTV Systems Using Multimodal Fusion and Attention-Based Networks |
| title_full_unstemmed | Design of an Improved Model for Anomaly Detection in CCTV Systems Using Multimodal Fusion and Attention-Based Networks |
| title_short | Design of an Improved Model for Anomaly Detection in CCTV Systems Using Multimodal Fusion and Attention-Based Networks |
| title_sort | design of an improved model for anomaly detection in cctv systems using multimodal fusion and attention based networks |
| topic | Anomaly detection deep learning multimodal fusion temporal context modeling unsupervised learning |
| url | https://ieeexplore.ieee.org/document/10876563/ |
| work_keys_str_mv | AT vsrilakshmi designofanimprovedmodelforanomalydetectionincctvsystemsusingmultimodalfusionandattentionbasednetworks AT saibabuveesam designofanimprovedmodelforanomalydetectionincctvsystemsusingmultimodalfusionandattentionbasednetworks AT mallushivaramakrishna designofanimprovedmodelforanomalydetectionincctvsystemsusingmultimodalfusionandattentionbasednetworks AT ravikumarmunaganuri designofanimprovedmodelforanomalydetectionincctvsystemsusingmultimodalfusionandattentionbasednetworks AT dulamdeveesivaprasad designofanimprovedmodelforanomalydetectionincctvsystemsusingmultimodalfusionandattentionbasednetworks |