Training-Free VLM-Based Pseudo Label Generation for Video Anomaly Detection

Video anomaly detection in weakly supervised settings remains a challenging task due to the absence of frame-level annotations. To address this, we propose a novel training-free pseudo-label generation module (TFPLG) for Weakly Supervised Video Anomaly Detection (WSVAD), which leverages the vision-l...

Full description

Saved in:
Bibliographic Details
Main Authors: Moshira Abdalla, Sajid Javed
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11015429/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849761309994778624
author Moshira Abdalla
Sajid Javed
author_facet Moshira Abdalla
Sajid Javed
author_sort Moshira Abdalla
collection DOAJ
description Video anomaly detection in weakly supervised settings remains a challenging task due to the absence of frame-level annotations. To address this, we propose a novel training-free pseudo-label generation module (TFPLG) for Weakly Supervised Video Anomaly Detection (WSVAD), which leverages the vision-language alignment of the pre-trained CLIP model to generate pseudo-labels without requiring any training. Unlike prior methods that depend on learned classifiers, our approach employs a threshold-guided similarity-matching mechanism to produce both fine-grained and coarse-grained pseudo-labels. The framework adopts a triple-branch architecture: the first branch generates pseudo-labels, while the second and third perform coarse-grained binary and fine-grained categorical classification. Temporal modeling is enhanced through the integration of transformers and Graph Convolutional Networks (GCNs) to capture both short- and long-range dependencies. Experiments on UCF-Crime and XD-Violence demonstrate the effectiveness of our approach, achieving a 1.4% average precision gain on XD-Violence compared to leading pseudo-labeling methods, and a 1.6% improvement in anomaly AUC on UCF-Crime over the best existing approaches. In zero-shot testing on the new MSAD dataset, our framework achieves a 3.24% AUC improvement, highlighting its robustness and adaptability. The source code is publicly available at: <uri>https://github.com/MoshiraAbdalla/TFPLG_VAD</uri>
format Article
id doaj-art-1dd0ea30f2fa4ae5a7a22d67042a0e52
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-1dd0ea30f2fa4ae5a7a22d67042a0e522025-08-20T03:06:04ZengIEEEIEEE Access2169-35362025-01-0113921559216710.1109/ACCESS.2025.357359411015429Training-Free VLM-Based Pseudo Label Generation for Video Anomaly DetectionMoshira Abdalla0https://orcid.org/0009-0008-2608-5352Sajid Javed1https://orcid.org/0000-0002-0036-2875Department of Computer Science, Khalifa University of Science and Technology, Abu Dhabi, United Arab EmiratesDepartment of Computer Science, Khalifa University of Science and Technology, Abu Dhabi, United Arab EmiratesVideo anomaly detection in weakly supervised settings remains a challenging task due to the absence of frame-level annotations. To address this, we propose a novel training-free pseudo-label generation module (TFPLG) for Weakly Supervised Video Anomaly Detection (WSVAD), which leverages the vision-language alignment of the pre-trained CLIP model to generate pseudo-labels without requiring any training. Unlike prior methods that depend on learned classifiers, our approach employs a threshold-guided similarity-matching mechanism to produce both fine-grained and coarse-grained pseudo-labels. The framework adopts a triple-branch architecture: the first branch generates pseudo-labels, while the second and third perform coarse-grained binary and fine-grained categorical classification. Temporal modeling is enhanced through the integration of transformers and Graph Convolutional Networks (GCNs) to capture both short- and long-range dependencies. Experiments on UCF-Crime and XD-Violence demonstrate the effectiveness of our approach, achieving a 1.4% average precision gain on XD-Violence compared to leading pseudo-labeling methods, and a 1.6% improvement in anomaly AUC on UCF-Crime over the best existing approaches. In zero-shot testing on the new MSAD dataset, our framework achieves a 3.24% AUC improvement, highlighting its robustness and adaptability. The source code is publicly available at: <uri>https://github.com/MoshiraAbdalla/TFPLG_VAD</uri>https://ieeexplore.ieee.org/document/11015429/Vision language modelsweakly supervised learningvideo anomaly detection
spellingShingle Moshira Abdalla
Sajid Javed
Training-Free VLM-Based Pseudo Label Generation for Video Anomaly Detection
IEEE Access
Vision language models
weakly supervised learning
video anomaly detection
title Training-Free VLM-Based Pseudo Label Generation for Video Anomaly Detection
title_full Training-Free VLM-Based Pseudo Label Generation for Video Anomaly Detection
title_fullStr Training-Free VLM-Based Pseudo Label Generation for Video Anomaly Detection
title_full_unstemmed Training-Free VLM-Based Pseudo Label Generation for Video Anomaly Detection
title_short Training-Free VLM-Based Pseudo Label Generation for Video Anomaly Detection
title_sort training free vlm based pseudo label generation for video anomaly detection
topic Vision language models
weakly supervised learning
video anomaly detection
url https://ieeexplore.ieee.org/document/11015429/
work_keys_str_mv AT moshiraabdalla trainingfreevlmbasedpseudolabelgenerationforvideoanomalydetection
AT sajidjaved trainingfreevlmbasedpseudolabelgenerationforvideoanomalydetection