Temporal Attention-based Vision Transformer for Source-Free Video Unsupervised Domain Adaptation
Abstract Source-Free Video Unsupervised Domain Adaptation (SFVUDA) is a difficult video action recognition issue where the adaptation procedure restricts access to source domain data. To tackle this, we suggest Temporal Attention-based Vision Transformer for SFVUDA (TAViT-SFVUDA) that makes use of c...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2025-07-01
|
| Series: | Discover Applied Sciences |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s42452-025-06909-2 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849238271718064128 |
|---|---|
| author | Shaimaa Yosry Lamiaa Elrefaei Rafaat ElKamaar Rania R. Ziedan |
| author_facet | Shaimaa Yosry Lamiaa Elrefaei Rafaat ElKamaar Rania R. Ziedan |
| author_sort | Shaimaa Yosry |
| collection | DOAJ |
| description | Abstract Source-Free Video Unsupervised Domain Adaptation (SFVUDA) is a difficult video action recognition issue where the adaptation procedure restricts access to source domain data. To tackle this, we suggest Temporal Attention-based Vision Transformer for SFVUDA (TAViT-SFVUDA) that makes use of confidence-aware learning techniques and temporal consistency. Our approach enables domain-invariant representation learning by guaranteeing both global temporal consistency between individual clips and the entire video, as well as local temporal consistency within clips of a single video. The model reduces noise in the target domain while successfully aligning with the distribution of source data by giving priority to high-confidence local features. Three essential elements form the foundation of TAViT-SFVUDA’s design: (1) domain-invariant representation learning, which guarantees reliable feature extraction; (2) time-dependent feature alignment, which records temporal dynamics across clips; and (3) pseudo-label generation with confidence filtering, which generates superior labels to direct adaptation. Our model is capable of self-supervised temporal feature extraction and domain alignment without the need for source domain data or target labels. Comprehensive tests on benchmark datasets such as ARID, Sports1M, HMDB51, and UCF101 show that TAViT-SFVUDA performs better than the most advanced Video Unsupervised Domain Adaptation (VUDA) and SFVUDA techniques. Our method provides a strong foundation for domain adaptation in video action identification, emphasizing its potential for practical uses in situations where source data availability is limited. |
| format | Article |
| id | doaj-art-03e511a14c21423887c4a32fc3c02c8d |
| institution | Kabale University |
| issn | 3004-9261 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | Springer |
| record_format | Article |
| series | Discover Applied Sciences |
| spelling | doaj-art-03e511a14c21423887c4a32fc3c02c8d2025-08-20T04:01:41ZengSpringerDiscover Applied Sciences3004-92612025-07-017712310.1007/s42452-025-06909-2Temporal Attention-based Vision Transformer for Source-Free Video Unsupervised Domain AdaptationShaimaa Yosry0Lamiaa Elrefaei1Rafaat ElKamaar2Rania R. Ziedan3Electrical Engineering Department, Faculty of Engineering at Shoubra, Benha UniversityElectrical Engineering Department, Faculty of Engineering at Shoubra, Benha UniversityElectrical Engineering Department, Faculty of Engineering at Shoubra, Benha UniversityElectrical Engineering Department, Faculty of Engineering at Shoubra, Benha UniversityAbstract Source-Free Video Unsupervised Domain Adaptation (SFVUDA) is a difficult video action recognition issue where the adaptation procedure restricts access to source domain data. To tackle this, we suggest Temporal Attention-based Vision Transformer for SFVUDA (TAViT-SFVUDA) that makes use of confidence-aware learning techniques and temporal consistency. Our approach enables domain-invariant representation learning by guaranteeing both global temporal consistency between individual clips and the entire video, as well as local temporal consistency within clips of a single video. The model reduces noise in the target domain while successfully aligning with the distribution of source data by giving priority to high-confidence local features. Three essential elements form the foundation of TAViT-SFVUDA’s design: (1) domain-invariant representation learning, which guarantees reliable feature extraction; (2) time-dependent feature alignment, which records temporal dynamics across clips; and (3) pseudo-label generation with confidence filtering, which generates superior labels to direct adaptation. Our model is capable of self-supervised temporal feature extraction and domain alignment without the need for source domain data or target labels. Comprehensive tests on benchmark datasets such as ARID, Sports1M, HMDB51, and UCF101 show that TAViT-SFVUDA performs better than the most advanced Video Unsupervised Domain Adaptation (VUDA) and SFVUDA techniques. Our method provides a strong foundation for domain adaptation in video action identification, emphasizing its potential for practical uses in situations where source data availability is limited.https://doi.org/10.1007/s42452-025-06909-2Source-freeUnsupervised domain adaptationDomain-invariant representationTime-dependent feature alignmentConfidence filtering |
| spellingShingle | Shaimaa Yosry Lamiaa Elrefaei Rafaat ElKamaar Rania R. Ziedan Temporal Attention-based Vision Transformer for Source-Free Video Unsupervised Domain Adaptation Discover Applied Sciences Source-free Unsupervised domain adaptation Domain-invariant representation Time-dependent feature alignment Confidence filtering |
| title | Temporal Attention-based Vision Transformer for Source-Free Video Unsupervised Domain Adaptation |
| title_full | Temporal Attention-based Vision Transformer for Source-Free Video Unsupervised Domain Adaptation |
| title_fullStr | Temporal Attention-based Vision Transformer for Source-Free Video Unsupervised Domain Adaptation |
| title_full_unstemmed | Temporal Attention-based Vision Transformer for Source-Free Video Unsupervised Domain Adaptation |
| title_short | Temporal Attention-based Vision Transformer for Source-Free Video Unsupervised Domain Adaptation |
| title_sort | temporal attention based vision transformer for source free video unsupervised domain adaptation |
| topic | Source-free Unsupervised domain adaptation Domain-invariant representation Time-dependent feature alignment Confidence filtering |
| url | https://doi.org/10.1007/s42452-025-06909-2 |
| work_keys_str_mv | AT shaimaayosry temporalattentionbasedvisiontransformerforsourcefreevideounsuperviseddomainadaptation AT lamiaaelrefaei temporalattentionbasedvisiontransformerforsourcefreevideounsuperviseddomainadaptation AT rafaatelkamaar temporalattentionbasedvisiontransformerforsourcefreevideounsuperviseddomainadaptation AT raniarziedan temporalattentionbasedvisiontransformerforsourcefreevideounsuperviseddomainadaptation |