Design of an Integrated Model for Video Summarization Using Multimodal Fusion and YOLO for Crime Scene Analysis
The scenario of crime scene analysis in video summarization is more demanding in that it involves the accurate and efficient extraction of critical key events from multi-camera footage, which may include the identification of persons of interest, weapons and complicated environments. The current app...
Saved in:
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10870165/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1823857131571380224 |
---|---|
author | Sai Babu Veesam Aravapalli Rama Satish |
author_facet | Sai Babu Veesam Aravapalli Rama Satish |
author_sort | Sai Babu Veesam |
collection | DOAJ |
description | The scenario of crime scene analysis in video summarization is more demanding in that it involves the accurate and efficient extraction of critical key events from multi-camera footage, which may include the identification of persons of interest, weapons and complicated environments. The current approaches suffer from occlusions, cross-camera person re ID and small-scale weapon detection, leading to the lack of a complete and inaccurate summary. Moreover, these methods are not very robust against changing environments and do not incorporate much feedback for continuous improvement operations. To overcome these limitations, this paper presents a comprehensive system for video summarization through multimodal fusion and spatiotemporal analysis across multiple camera streams. The system integrates the following advanced technologies: AGMFN for person detection and identity matching which employs multi-head attention to fuse RGB frames, motion data and optical flow. YOLOv8 with Feature Pyramid Networks is used for multiple Scale weapon detection in order to capture smaller, partially occluded objects within cluttered scenes. Spatio-temporal action localization is achieved with the help of 3D Convolutional Neural Networks, along with Temporal Attention Networks that capture all weapon-related actions with the best set of critical frames. Finally, a feedback-driven reinforcement learning framework named RL-HITL allows continuous improvement based on human input, which enhances the adaptability of the system over temporal instance sets. This integrated system has good accuracy in person detection ranging from 95-98%, weapon detection at 92-95% and even action localization ranging from 88-91%. At the same time, it reduces the length of video by 70-80%. Real-time learning through RL-HITL ensures model refinement and hereby gives long-term benefits in security and surveillance application, hence analysis at a crime scene for different scenarios. |
format | Article |
id | doaj-art-5bb4159e64204395a0ee080aea5f40ab |
institution | Kabale University |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-5bb4159e64204395a0ee080aea5f40ab2025-02-12T00:02:25ZengIEEEIEEE Access2169-35362025-01-0113250082502510.1109/ACCESS.2025.353828210870165Design of an Integrated Model for Video Summarization Using Multimodal Fusion and YOLO for Crime Scene AnalysisSai Babu Veesam0https://orcid.org/0009-0000-5473-4681Aravapalli Rama Satish1https://orcid.org/0000-0002-4323-8073School of Computer Science and Engineering, VIT-AP University, Amaravati, IndiaSchool of Computer Science and Engineering, VIT-AP University, Amaravati, IndiaThe scenario of crime scene analysis in video summarization is more demanding in that it involves the accurate and efficient extraction of critical key events from multi-camera footage, which may include the identification of persons of interest, weapons and complicated environments. The current approaches suffer from occlusions, cross-camera person re ID and small-scale weapon detection, leading to the lack of a complete and inaccurate summary. Moreover, these methods are not very robust against changing environments and do not incorporate much feedback for continuous improvement operations. To overcome these limitations, this paper presents a comprehensive system for video summarization through multimodal fusion and spatiotemporal analysis across multiple camera streams. The system integrates the following advanced technologies: AGMFN for person detection and identity matching which employs multi-head attention to fuse RGB frames, motion data and optical flow. YOLOv8 with Feature Pyramid Networks is used for multiple Scale weapon detection in order to capture smaller, partially occluded objects within cluttered scenes. Spatio-temporal action localization is achieved with the help of 3D Convolutional Neural Networks, along with Temporal Attention Networks that capture all weapon-related actions with the best set of critical frames. Finally, a feedback-driven reinforcement learning framework named RL-HITL allows continuous improvement based on human input, which enhances the adaptability of the system over temporal instance sets. This integrated system has good accuracy in person detection ranging from 95-98%, weapon detection at 92-95% and even action localization ranging from 88-91%. At the same time, it reduces the length of video by 70-80%. Real-time learning through RL-HITL ensures model refinement and hereby gives long-term benefits in security and surveillance application, hence analysis at a crime scene for different scenarios.https://ieeexplore.ieee.org/document/10870165/Crime scene analysismultimodal fusionperson detectionprocessvideo summarizationweapon detection |
spellingShingle | Sai Babu Veesam Aravapalli Rama Satish Design of an Integrated Model for Video Summarization Using Multimodal Fusion and YOLO for Crime Scene Analysis IEEE Access Crime scene analysis multimodal fusion person detection process video summarization weapon detection |
title | Design of an Integrated Model for Video Summarization Using Multimodal Fusion and YOLO for Crime Scene Analysis |
title_full | Design of an Integrated Model for Video Summarization Using Multimodal Fusion and YOLO for Crime Scene Analysis |
title_fullStr | Design of an Integrated Model for Video Summarization Using Multimodal Fusion and YOLO for Crime Scene Analysis |
title_full_unstemmed | Design of an Integrated Model for Video Summarization Using Multimodal Fusion and YOLO for Crime Scene Analysis |
title_short | Design of an Integrated Model for Video Summarization Using Multimodal Fusion and YOLO for Crime Scene Analysis |
title_sort | design of an integrated model for video summarization using multimodal fusion and yolo for crime scene analysis |
topic | Crime scene analysis multimodal fusion person detection process video summarization weapon detection |
url | https://ieeexplore.ieee.org/document/10870165/ |
work_keys_str_mv | AT saibabuveesam designofanintegratedmodelforvideosummarizationusingmultimodalfusionandyoloforcrimesceneanalysis AT aravapalliramasatish designofanintegratedmodelforvideosummarizationusingmultimodalfusionandyoloforcrimesceneanalysis |