Design of an Integrated Model for Video Summarization Using Multimodal Fusion and YOLO for Crime Scene Analysis

The scenario of crime scene analysis in video summarization is more demanding in that it involves the accurate and efficient extraction of critical key events from multi-camera footage, which may include the identification of persons of interest, weapons and complicated environments. The current app...

Full description

Saved in:

Bibliographic Details
Main Authors:	Sai Babu Veesam, Aravapalli Rama Satish
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Crime scene analysis multimodal fusion person detection process video summarization weapon detection
Online Access:	https://ieeexplore.ieee.org/document/10870165/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1823857131571380224
author	Sai Babu Veesam Aravapalli Rama Satish
author_facet	Sai Babu Veesam Aravapalli Rama Satish
author_sort	Sai Babu Veesam
collection	DOAJ
description	The scenario of crime scene analysis in video summarization is more demanding in that it involves the accurate and efficient extraction of critical key events from multi-camera footage, which may include the identification of persons of interest, weapons and complicated environments. The current approaches suffer from occlusions, cross-camera person re ID and small-scale weapon detection, leading to the lack of a complete and inaccurate summary. Moreover, these methods are not very robust against changing environments and do not incorporate much feedback for continuous improvement operations. To overcome these limitations, this paper presents a comprehensive system for video summarization through multimodal fusion and spatiotemporal analysis across multiple camera streams. The system integrates the following advanced technologies: AGMFN for person detection and identity matching which employs multi-head attention to fuse RGB frames, motion data and optical flow. YOLOv8 with Feature Pyramid Networks is used for multiple Scale weapon detection in order to capture smaller, partially occluded objects within cluttered scenes. Spatio-temporal action localization is achieved with the help of 3D Convolutional Neural Networks, along with Temporal Attention Networks that capture all weapon-related actions with the best set of critical frames. Finally, a feedback-driven reinforcement learning framework named RL-HITL allows continuous improvement based on human input, which enhances the adaptability of the system over temporal instance sets. This integrated system has good accuracy in person detection ranging from 95-98%, weapon detection at 92-95% and even action localization ranging from 88-91%. At the same time, it reduces the length of video by 70-80%. Real-time learning through RL-HITL ensures model refinement and hereby gives long-term benefits in security and surveillance application, hence analysis at a crime scene for different scenarios.
format	Article
id	doaj-art-5bb4159e64204395a0ee080aea5f40ab
institution	Kabale University
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-5bb4159e64204395a0ee080aea5f40ab2025-02-12T00:02:25ZengIEEEIEEE Access2169-35362025-01-0113250082502510.1109/ACCESS.2025.353828210870165Design of an Integrated Model for Video Summarization Using Multimodal Fusion and YOLO for Crime Scene AnalysisSai Babu Veesam0https://orcid.org/0009-0000-5473-4681Aravapalli Rama Satish1https://orcid.org/0000-0002-4323-8073School of Computer Science and Engineering, VIT-AP University, Amaravati, IndiaSchool of Computer Science and Engineering, VIT-AP University, Amaravati, IndiaThe scenario of crime scene analysis in video summarization is more demanding in that it involves the accurate and efficient extraction of critical key events from multi-camera footage, which may include the identification of persons of interest, weapons and complicated environments. The current approaches suffer from occlusions, cross-camera person re ID and small-scale weapon detection, leading to the lack of a complete and inaccurate summary. Moreover, these methods are not very robust against changing environments and do not incorporate much feedback for continuous improvement operations. To overcome these limitations, this paper presents a comprehensive system for video summarization through multimodal fusion and spatiotemporal analysis across multiple camera streams. The system integrates the following advanced technologies: AGMFN for person detection and identity matching which employs multi-head attention to fuse RGB frames, motion data and optical flow. YOLOv8 with Feature Pyramid Networks is used for multiple Scale weapon detection in order to capture smaller, partially occluded objects within cluttered scenes. Spatio-temporal action localization is achieved with the help of 3D Convolutional Neural Networks, along with Temporal Attention Networks that capture all weapon-related actions with the best set of critical frames. Finally, a feedback-driven reinforcement learning framework named RL-HITL allows continuous improvement based on human input, which enhances the adaptability of the system over temporal instance sets. This integrated system has good accuracy in person detection ranging from 95-98%, weapon detection at 92-95% and even action localization ranging from 88-91%. At the same time, it reduces the length of video by 70-80%. Real-time learning through RL-HITL ensures model refinement and hereby gives long-term benefits in security and surveillance application, hence analysis at a crime scene for different scenarios.https://ieeexplore.ieee.org/document/10870165/Crime scene analysismultimodal fusionperson detectionprocessvideo summarizationweapon detection
spellingShingle	Sai Babu Veesam Aravapalli Rama Satish Design of an Integrated Model for Video Summarization Using Multimodal Fusion and YOLO for Crime Scene Analysis IEEE Access Crime scene analysis multimodal fusion person detection process video summarization weapon detection
title	Design of an Integrated Model for Video Summarization Using Multimodal Fusion and YOLO for Crime Scene Analysis
title_full	Design of an Integrated Model for Video Summarization Using Multimodal Fusion and YOLO for Crime Scene Analysis
title_fullStr	Design of an Integrated Model for Video Summarization Using Multimodal Fusion and YOLO for Crime Scene Analysis
title_full_unstemmed	Design of an Integrated Model for Video Summarization Using Multimodal Fusion and YOLO for Crime Scene Analysis
title_short	Design of an Integrated Model for Video Summarization Using Multimodal Fusion and YOLO for Crime Scene Analysis
title_sort	design of an integrated model for video summarization using multimodal fusion and yolo for crime scene analysis
topic	Crime scene analysis multimodal fusion person detection process video summarization weapon detection
url	https://ieeexplore.ieee.org/document/10870165/
work_keys_str_mv	AT saibabuveesam designofanintegratedmodelforvideosummarizationusingmultimodalfusionandyoloforcrimesceneanalysis AT aravapalliramasatish designofanintegratedmodelforvideosummarizationusingmultimodalfusionandyoloforcrimesceneanalysis

Design of an Integrated Model for Video Summarization Using Multimodal Fusion and YOLO for Crime Scene Analysis

Similar Items