Training-Free VLM-Based Pseudo Label Generation for Video Anomaly Detection
Video anomaly detection in weakly supervised settings remains a challenging task due to the absence of frame-level annotations. To address this, we propose a novel training-free pseudo-label generation module (TFPLG) for Weakly Supervised Video Anomaly Detection (WSVAD), which leverages the vision-l...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11015429/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849761309994778624 |
|---|---|
| author | Moshira Abdalla Sajid Javed |
| author_facet | Moshira Abdalla Sajid Javed |
| author_sort | Moshira Abdalla |
| collection | DOAJ |
| description | Video anomaly detection in weakly supervised settings remains a challenging task due to the absence of frame-level annotations. To address this, we propose a novel training-free pseudo-label generation module (TFPLG) for Weakly Supervised Video Anomaly Detection (WSVAD), which leverages the vision-language alignment of the pre-trained CLIP model to generate pseudo-labels without requiring any training. Unlike prior methods that depend on learned classifiers, our approach employs a threshold-guided similarity-matching mechanism to produce both fine-grained and coarse-grained pseudo-labels. The framework adopts a triple-branch architecture: the first branch generates pseudo-labels, while the second and third perform coarse-grained binary and fine-grained categorical classification. Temporal modeling is enhanced through the integration of transformers and Graph Convolutional Networks (GCNs) to capture both short- and long-range dependencies. Experiments on UCF-Crime and XD-Violence demonstrate the effectiveness of our approach, achieving a 1.4% average precision gain on XD-Violence compared to leading pseudo-labeling methods, and a 1.6% improvement in anomaly AUC on UCF-Crime over the best existing approaches. In zero-shot testing on the new MSAD dataset, our framework achieves a 3.24% AUC improvement, highlighting its robustness and adaptability. The source code is publicly available at: <uri>https://github.com/MoshiraAbdalla/TFPLG_VAD</uri> |
| format | Article |
| id | doaj-art-1dd0ea30f2fa4ae5a7a22d67042a0e52 |
| institution | DOAJ |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-1dd0ea30f2fa4ae5a7a22d67042a0e522025-08-20T03:06:04ZengIEEEIEEE Access2169-35362025-01-0113921559216710.1109/ACCESS.2025.357359411015429Training-Free VLM-Based Pseudo Label Generation for Video Anomaly DetectionMoshira Abdalla0https://orcid.org/0009-0008-2608-5352Sajid Javed1https://orcid.org/0000-0002-0036-2875Department of Computer Science, Khalifa University of Science and Technology, Abu Dhabi, United Arab EmiratesDepartment of Computer Science, Khalifa University of Science and Technology, Abu Dhabi, United Arab EmiratesVideo anomaly detection in weakly supervised settings remains a challenging task due to the absence of frame-level annotations. To address this, we propose a novel training-free pseudo-label generation module (TFPLG) for Weakly Supervised Video Anomaly Detection (WSVAD), which leverages the vision-language alignment of the pre-trained CLIP model to generate pseudo-labels without requiring any training. Unlike prior methods that depend on learned classifiers, our approach employs a threshold-guided similarity-matching mechanism to produce both fine-grained and coarse-grained pseudo-labels. The framework adopts a triple-branch architecture: the first branch generates pseudo-labels, while the second and third perform coarse-grained binary and fine-grained categorical classification. Temporal modeling is enhanced through the integration of transformers and Graph Convolutional Networks (GCNs) to capture both short- and long-range dependencies. Experiments on UCF-Crime and XD-Violence demonstrate the effectiveness of our approach, achieving a 1.4% average precision gain on XD-Violence compared to leading pseudo-labeling methods, and a 1.6% improvement in anomaly AUC on UCF-Crime over the best existing approaches. In zero-shot testing on the new MSAD dataset, our framework achieves a 3.24% AUC improvement, highlighting its robustness and adaptability. The source code is publicly available at: <uri>https://github.com/MoshiraAbdalla/TFPLG_VAD</uri>https://ieeexplore.ieee.org/document/11015429/Vision language modelsweakly supervised learningvideo anomaly detection |
| spellingShingle | Moshira Abdalla Sajid Javed Training-Free VLM-Based Pseudo Label Generation for Video Anomaly Detection IEEE Access Vision language models weakly supervised learning video anomaly detection |
| title | Training-Free VLM-Based Pseudo Label Generation for Video Anomaly Detection |
| title_full | Training-Free VLM-Based Pseudo Label Generation for Video Anomaly Detection |
| title_fullStr | Training-Free VLM-Based Pseudo Label Generation for Video Anomaly Detection |
| title_full_unstemmed | Training-Free VLM-Based Pseudo Label Generation for Video Anomaly Detection |
| title_short | Training-Free VLM-Based Pseudo Label Generation for Video Anomaly Detection |
| title_sort | training free vlm based pseudo label generation for video anomaly detection |
| topic | Vision language models weakly supervised learning video anomaly detection |
| url | https://ieeexplore.ieee.org/document/11015429/ |
| work_keys_str_mv | AT moshiraabdalla trainingfreevlmbasedpseudolabelgenerationforvideoanomalydetection AT sajidjaved trainingfreevlmbasedpseudolabelgenerationforvideoanomalydetection |