Multimodal anomaly detection in complex environments using video and audio fusion
Abstract Due to complex environmental conditions and varying noise levels, traditional models are limited in their effectiveness for detecting anomalies in video sequences. Aiming at the challenges of accuracy, robustness, and real-time processing requirements in the field of image and video process...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-05-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-01146-4 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract Due to complex environmental conditions and varying noise levels, traditional models are limited in their effectiveness for detecting anomalies in video sequences. Aiming at the challenges of accuracy, robustness, and real-time processing requirements in the field of image and video processing, this study proposes an anomaly detection and recognition algorithm for video image data based on deep learning. The algorithm combines the innovative methods of spatio-temporal feature extraction and noise suppression, and aims to improve the processing performance, especially in complex environments, by introducing an improved Variable Auto Encoder (VAE) structure. The model named Spatio-Temporal Anomaly Detection Network (STADNet) captures the spatio-temporal features of video images through multi-scale Three-Dimensional (3D) convolution module and spatio-temporal attention mechanism. This approach improves the accuracy of anomaly detection. Multi-stream network architecture and cross-attention fusion mechanism are also adopted to comprehensively consider different factors such as color, texture, and motion, and further improve the robustness and generalization ability of the model. The experimental results show that compared with the existing models, the new model has obvious advantages in performance stability and real-time processing under different noise levels. Specifically, the AUC value of the proposed model is 0.95 on UCSD Ped2 dataset, which is about 10% higher than other models, and the AUC value on Avenue dataset is 0.93, which is about 12% higher. This study not only proposes an effective image and video processing scheme but also demonstrates wide practical potential, providing a new perspective and methodological basis for future research and application in related fields. |
|---|---|
| ISSN: | 2045-2322 |