Semi-supervised action recognition using logit aligned consistency and adaptive negative learning

Abstract With the development of the socialized video era, while semi-supervised action recognition can address the increasingly high costs of video annotation, it still faces significant challenges, particularly in the underexplored application of Vision Transformer. In this paper, we present our w...

Full description

Saved in:
Bibliographic Details
Main Authors: Fengyun Zuo, Yang Xu, Minggang Wang
Format: Article
Language:English
Published: Nature Portfolio 2025-05-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-01922-2
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract With the development of the socialized video era, while semi-supervised action recognition can address the increasingly high costs of video annotation, it still faces significant challenges, particularly in the underexplored application of Vision Transformer. In this paper, we present our work on designing Full-SVFormer, a simple yet efficient semi-supervised action recognition architecture based on the Transformer framework. Full-SVFormer uses TimeSformer based on pre-trained weights as its backbone network, which balances the accuracy and speed of Transformers in semi-supervised action recognition. Within the stable pseudo-label framework EMA-Teacher, we introduce KL divergence loss, which has undergone logit standardization preprocessing, as its unsupervised consistency loss. This improves the student’s focus on the inherent relationship between the logit of student and teacher. Furthermore, we incorporate the Adaptive Negative Learning (ANL) method to introduce additional negative pseudo-labels, which dynamically evaluate the Top-k performance of the model to adaptively assign negative labels, thus making better use of ambiguous prediction examples. We conducted a number of experiments on two extensive datasets, UCF-101 and HMDB-51, where our overall experimental results achieved superior performance compared to previous methods. Our work further advances the development of Transformer in the domain of semi-supervised action recognition.
ISSN:2045-2322