Design of an Integrated Model for Video Summarization Using Multimodal Fusion and YOLO for Crime Scene Analysis

The scenario of crime scene analysis in video summarization is more demanding in that it involves the accurate and efficient extraction of critical key events from multi-camera footage, which may include the identification of persons of interest, weapons and complicated environments. The current app...

Full description

Saved in:
Bibliographic Details
Main Authors: Sai Babu Veesam, Aravapalli Rama Satish
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10870165/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The scenario of crime scene analysis in video summarization is more demanding in that it involves the accurate and efficient extraction of critical key events from multi-camera footage, which may include the identification of persons of interest, weapons and complicated environments. The current approaches suffer from occlusions, cross-camera person re ID and small-scale weapon detection, leading to the lack of a complete and inaccurate summary. Moreover, these methods are not very robust against changing environments and do not incorporate much feedback for continuous improvement operations. To overcome these limitations, this paper presents a comprehensive system for video summarization through multimodal fusion and spatiotemporal analysis across multiple camera streams. The system integrates the following advanced technologies: AGMFN for person detection and identity matching which employs multi-head attention to fuse RGB frames, motion data and optical flow. YOLOv8 with Feature Pyramid Networks is used for multiple Scale weapon detection in order to capture smaller, partially occluded objects within cluttered scenes. Spatio-temporal action localization is achieved with the help of 3D Convolutional Neural Networks, along with Temporal Attention Networks that capture all weapon-related actions with the best set of critical frames. Finally, a feedback-driven reinforcement learning framework named RL-HITL allows continuous improvement based on human input, which enhances the adaptability of the system over temporal instance sets. This integrated system has good accuracy in person detection ranging from 95-98%, weapon detection at 92-95% and even action localization ranging from 88-91%. At the same time, it reduces the length of video by 70-80%. Real-time learning through RL-HITL ensures model refinement and hereby gives long-term benefits in security and surveillance application, hence analysis at a crime scene for different scenarios.
ISSN:2169-3536