Understanding Video Transformers: A Review on Key Strategies for Feature Learning and Performance Optimization

The video transformer model, a deep learning tool relying on the self-attention mechanism, is capable of efficiently capturing and processing spatiotemporal information in videos through effective spatiotemporal modeling, thereby enabling deep analysis and precise understanding of video content. It...

Full description

Saved in:
Bibliographic Details
Main Authors: Nan Chen, Tie Xu, Mingrui Sun, Chenggui Yao, Dongping Yang
Format: Article
Language:English
Published: American Association for the Advancement of Science (AAAS) 2025-01-01
Series:Intelligent Computing
Online Access:https://spj.science.org/doi/10.34133/icomputing.0143
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850114817531052032
author Nan Chen
Tie Xu
Mingrui Sun
Chenggui Yao
Dongping Yang
author_facet Nan Chen
Tie Xu
Mingrui Sun
Chenggui Yao
Dongping Yang
author_sort Nan Chen
collection DOAJ
description The video transformer model, a deep learning tool relying on the self-attention mechanism, is capable of efficiently capturing and processing spatiotemporal information in videos through effective spatiotemporal modeling, thereby enabling deep analysis and precise understanding of video content. It has become a focal point of academic attention. This paper first reviews the classic model architectures and notable achievements of the transformer in the domains of natural language processing (NLP) and image processing. It then explores performance enhancement strategies and video feature learning methods for the video transformer, considering 4 key dimensions: input module optimization, internal structure innovation, overall framework design, and hybrid model construction. Finally, it summarizes the latest advancements of the video transformer in cutting-edge application areas such as video classification, action recognition, video object detection, and video object segmentation. A comprehensive outlook on the future research trends and potential challenges of the video transformer is also provided as a reference for subsequent studies.
format Article
id doaj-art-7ea6cfaf50e84ba1970d39159fea3d06
institution OA Journals
issn 2771-5892
language English
publishDate 2025-01-01
publisher American Association for the Advancement of Science (AAAS)
record_format Article
series Intelligent Computing
spelling doaj-art-7ea6cfaf50e84ba1970d39159fea3d062025-08-20T02:36:45ZengAmerican Association for the Advancement of Science (AAAS)Intelligent Computing2771-58922025-01-01410.34133/icomputing.0143Understanding Video Transformers: A Review on Key Strategies for Feature Learning and Performance OptimizationNan Chen0Tie Xu1Mingrui Sun2Chenggui Yao3Dongping Yang4School of Mathematical Sciences, Zhejiang Normal University, Jinhua, Zhejiang, China.Research Centre for Frontier Fundamental Studies, Zhejiang Lab, Hangzhou, Zhejiang, China.School of Data Science, Jiaxing University, Jiaxing, Zhejiang, China.School of Data Science, Jiaxing University, Jiaxing, Zhejiang, China.Research Centre for Frontier Fundamental Studies, Zhejiang Lab, Hangzhou, Zhejiang, China.The video transformer model, a deep learning tool relying on the self-attention mechanism, is capable of efficiently capturing and processing spatiotemporal information in videos through effective spatiotemporal modeling, thereby enabling deep analysis and precise understanding of video content. It has become a focal point of academic attention. This paper first reviews the classic model architectures and notable achievements of the transformer in the domains of natural language processing (NLP) and image processing. It then explores performance enhancement strategies and video feature learning methods for the video transformer, considering 4 key dimensions: input module optimization, internal structure innovation, overall framework design, and hybrid model construction. Finally, it summarizes the latest advancements of the video transformer in cutting-edge application areas such as video classification, action recognition, video object detection, and video object segmentation. A comprehensive outlook on the future research trends and potential challenges of the video transformer is also provided as a reference for subsequent studies.https://spj.science.org/doi/10.34133/icomputing.0143
spellingShingle Nan Chen
Tie Xu
Mingrui Sun
Chenggui Yao
Dongping Yang
Understanding Video Transformers: A Review on Key Strategies for Feature Learning and Performance Optimization
Intelligent Computing
title Understanding Video Transformers: A Review on Key Strategies for Feature Learning and Performance Optimization
title_full Understanding Video Transformers: A Review on Key Strategies for Feature Learning and Performance Optimization
title_fullStr Understanding Video Transformers: A Review on Key Strategies for Feature Learning and Performance Optimization
title_full_unstemmed Understanding Video Transformers: A Review on Key Strategies for Feature Learning and Performance Optimization
title_short Understanding Video Transformers: A Review on Key Strategies for Feature Learning and Performance Optimization
title_sort understanding video transformers a review on key strategies for feature learning and performance optimization
url https://spj.science.org/doi/10.34133/icomputing.0143
work_keys_str_mv AT nanchen understandingvideotransformersareviewonkeystrategiesforfeaturelearningandperformanceoptimization
AT tiexu understandingvideotransformersareviewonkeystrategiesforfeaturelearningandperformanceoptimization
AT mingruisun understandingvideotransformersareviewonkeystrategiesforfeaturelearningandperformanceoptimization
AT chengguiyao understandingvideotransformersareviewonkeystrategiesforfeaturelearningandperformanceoptimization
AT dongpingyang understandingvideotransformersareviewonkeystrategiesforfeaturelearningandperformanceoptimization