Temporal Attention-based Vision Transformer for Source-Free Video Unsupervised Domain Adaptation

Abstract Source-Free Video Unsupervised Domain Adaptation (SFVUDA) is a difficult video action recognition issue where the adaptation procedure restricts access to source domain data. To tackle this, we suggest Temporal Attention-based Vision Transformer for SFVUDA (TAViT-SFVUDA) that makes use of c...

Full description

Saved in:
Bibliographic Details
Main Authors: Shaimaa Yosry, Lamiaa Elrefaei, Rafaat ElKamaar, Rania R. Ziedan
Format: Article
Language:English
Published: Springer 2025-07-01
Series:Discover Applied Sciences
Subjects:
Online Access:https://doi.org/10.1007/s42452-025-06909-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849238271718064128
author Shaimaa Yosry
Lamiaa Elrefaei
Rafaat ElKamaar
Rania R. Ziedan
author_facet Shaimaa Yosry
Lamiaa Elrefaei
Rafaat ElKamaar
Rania R. Ziedan
author_sort Shaimaa Yosry
collection DOAJ
description Abstract Source-Free Video Unsupervised Domain Adaptation (SFVUDA) is a difficult video action recognition issue where the adaptation procedure restricts access to source domain data. To tackle this, we suggest Temporal Attention-based Vision Transformer for SFVUDA (TAViT-SFVUDA) that makes use of confidence-aware learning techniques and temporal consistency. Our approach enables domain-invariant representation learning by guaranteeing both global temporal consistency between individual clips and the entire video, as well as local temporal consistency within clips of a single video. The model reduces noise in the target domain while successfully aligning with the distribution of source data by giving priority to high-confidence local features. Three essential elements form the foundation of TAViT-SFVUDA’s design: (1) domain-invariant representation learning, which guarantees reliable feature extraction; (2) time-dependent feature alignment, which records temporal dynamics across clips; and (3) pseudo-label generation with confidence filtering, which generates superior labels to direct adaptation. Our model is capable of self-supervised temporal feature extraction and domain alignment without the need for source domain data or target labels. Comprehensive tests on benchmark datasets such as ARID, Sports1M, HMDB51, and UCF101 show that TAViT-SFVUDA performs better than the most advanced Video Unsupervised Domain Adaptation (VUDA) and SFVUDA techniques. Our method provides a strong foundation for domain adaptation in video action identification, emphasizing its potential for practical uses in situations where source data availability is limited.
format Article
id doaj-art-03e511a14c21423887c4a32fc3c02c8d
institution Kabale University
issn 3004-9261
language English
publishDate 2025-07-01
publisher Springer
record_format Article
series Discover Applied Sciences
spelling doaj-art-03e511a14c21423887c4a32fc3c02c8d2025-08-20T04:01:41ZengSpringerDiscover Applied Sciences3004-92612025-07-017712310.1007/s42452-025-06909-2Temporal Attention-based Vision Transformer for Source-Free Video Unsupervised Domain AdaptationShaimaa Yosry0Lamiaa Elrefaei1Rafaat ElKamaar2Rania R. Ziedan3Electrical Engineering Department, Faculty of Engineering at Shoubra, Benha UniversityElectrical Engineering Department, Faculty of Engineering at Shoubra, Benha UniversityElectrical Engineering Department, Faculty of Engineering at Shoubra, Benha UniversityElectrical Engineering Department, Faculty of Engineering at Shoubra, Benha UniversityAbstract Source-Free Video Unsupervised Domain Adaptation (SFVUDA) is a difficult video action recognition issue where the adaptation procedure restricts access to source domain data. To tackle this, we suggest Temporal Attention-based Vision Transformer for SFVUDA (TAViT-SFVUDA) that makes use of confidence-aware learning techniques and temporal consistency. Our approach enables domain-invariant representation learning by guaranteeing both global temporal consistency between individual clips and the entire video, as well as local temporal consistency within clips of a single video. The model reduces noise in the target domain while successfully aligning with the distribution of source data by giving priority to high-confidence local features. Three essential elements form the foundation of TAViT-SFVUDA’s design: (1) domain-invariant representation learning, which guarantees reliable feature extraction; (2) time-dependent feature alignment, which records temporal dynamics across clips; and (3) pseudo-label generation with confidence filtering, which generates superior labels to direct adaptation. Our model is capable of self-supervised temporal feature extraction and domain alignment without the need for source domain data or target labels. Comprehensive tests on benchmark datasets such as ARID, Sports1M, HMDB51, and UCF101 show that TAViT-SFVUDA performs better than the most advanced Video Unsupervised Domain Adaptation (VUDA) and SFVUDA techniques. Our method provides a strong foundation for domain adaptation in video action identification, emphasizing its potential for practical uses in situations where source data availability is limited.https://doi.org/10.1007/s42452-025-06909-2Source-freeUnsupervised domain adaptationDomain-invariant representationTime-dependent feature alignmentConfidence filtering
spellingShingle Shaimaa Yosry
Lamiaa Elrefaei
Rafaat ElKamaar
Rania R. Ziedan
Temporal Attention-based Vision Transformer for Source-Free Video Unsupervised Domain Adaptation
Discover Applied Sciences
Source-free
Unsupervised domain adaptation
Domain-invariant representation
Time-dependent feature alignment
Confidence filtering
title Temporal Attention-based Vision Transformer for Source-Free Video Unsupervised Domain Adaptation
title_full Temporal Attention-based Vision Transformer for Source-Free Video Unsupervised Domain Adaptation
title_fullStr Temporal Attention-based Vision Transformer for Source-Free Video Unsupervised Domain Adaptation
title_full_unstemmed Temporal Attention-based Vision Transformer for Source-Free Video Unsupervised Domain Adaptation
title_short Temporal Attention-based Vision Transformer for Source-Free Video Unsupervised Domain Adaptation
title_sort temporal attention based vision transformer for source free video unsupervised domain adaptation
topic Source-free
Unsupervised domain adaptation
Domain-invariant representation
Time-dependent feature alignment
Confidence filtering
url https://doi.org/10.1007/s42452-025-06909-2
work_keys_str_mv AT shaimaayosry temporalattentionbasedvisiontransformerforsourcefreevideounsuperviseddomainadaptation
AT lamiaaelrefaei temporalattentionbasedvisiontransformerforsourcefreevideounsuperviseddomainadaptation
AT rafaatelkamaar temporalattentionbasedvisiontransformerforsourcefreevideounsuperviseddomainadaptation
AT raniarziedan temporalattentionbasedvisiontransformerforsourcefreevideounsuperviseddomainadaptation