Understanding Video Transformers: A Review on Key Strategies for Feature Learning and Performance Optimization
The video transformer model, a deep learning tool relying on the self-attention mechanism, is capable of efficiently capturing and processing spatiotemporal information in videos through effective spatiotemporal modeling, thereby enabling deep analysis and precise understanding of video content. It...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
American Association for the Advancement of Science (AAAS)
2025-01-01
|
| Series: | Intelligent Computing |
| Online Access: | https://spj.science.org/doi/10.34133/icomputing.0143 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850114817531052032 |
|---|---|
| author | Nan Chen Tie Xu Mingrui Sun Chenggui Yao Dongping Yang |
| author_facet | Nan Chen Tie Xu Mingrui Sun Chenggui Yao Dongping Yang |
| author_sort | Nan Chen |
| collection | DOAJ |
| description | The video transformer model, a deep learning tool relying on the self-attention mechanism, is capable of efficiently capturing and processing spatiotemporal information in videos through effective spatiotemporal modeling, thereby enabling deep analysis and precise understanding of video content. It has become a focal point of academic attention. This paper first reviews the classic model architectures and notable achievements of the transformer in the domains of natural language processing (NLP) and image processing. It then explores performance enhancement strategies and video feature learning methods for the video transformer, considering 4 key dimensions: input module optimization, internal structure innovation, overall framework design, and hybrid model construction. Finally, it summarizes the latest advancements of the video transformer in cutting-edge application areas such as video classification, action recognition, video object detection, and video object segmentation. A comprehensive outlook on the future research trends and potential challenges of the video transformer is also provided as a reference for subsequent studies. |
| format | Article |
| id | doaj-art-7ea6cfaf50e84ba1970d39159fea3d06 |
| institution | OA Journals |
| issn | 2771-5892 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | American Association for the Advancement of Science (AAAS) |
| record_format | Article |
| series | Intelligent Computing |
| spelling | doaj-art-7ea6cfaf50e84ba1970d39159fea3d062025-08-20T02:36:45ZengAmerican Association for the Advancement of Science (AAAS)Intelligent Computing2771-58922025-01-01410.34133/icomputing.0143Understanding Video Transformers: A Review on Key Strategies for Feature Learning and Performance OptimizationNan Chen0Tie Xu1Mingrui Sun2Chenggui Yao3Dongping Yang4School of Mathematical Sciences, Zhejiang Normal University, Jinhua, Zhejiang, China.Research Centre for Frontier Fundamental Studies, Zhejiang Lab, Hangzhou, Zhejiang, China.School of Data Science, Jiaxing University, Jiaxing, Zhejiang, China.School of Data Science, Jiaxing University, Jiaxing, Zhejiang, China.Research Centre for Frontier Fundamental Studies, Zhejiang Lab, Hangzhou, Zhejiang, China.The video transformer model, a deep learning tool relying on the self-attention mechanism, is capable of efficiently capturing and processing spatiotemporal information in videos through effective spatiotemporal modeling, thereby enabling deep analysis and precise understanding of video content. It has become a focal point of academic attention. This paper first reviews the classic model architectures and notable achievements of the transformer in the domains of natural language processing (NLP) and image processing. It then explores performance enhancement strategies and video feature learning methods for the video transformer, considering 4 key dimensions: input module optimization, internal structure innovation, overall framework design, and hybrid model construction. Finally, it summarizes the latest advancements of the video transformer in cutting-edge application areas such as video classification, action recognition, video object detection, and video object segmentation. A comprehensive outlook on the future research trends and potential challenges of the video transformer is also provided as a reference for subsequent studies.https://spj.science.org/doi/10.34133/icomputing.0143 |
| spellingShingle | Nan Chen Tie Xu Mingrui Sun Chenggui Yao Dongping Yang Understanding Video Transformers: A Review on Key Strategies for Feature Learning and Performance Optimization Intelligent Computing |
| title | Understanding Video Transformers: A Review on Key Strategies for Feature Learning and Performance Optimization |
| title_full | Understanding Video Transformers: A Review on Key Strategies for Feature Learning and Performance Optimization |
| title_fullStr | Understanding Video Transformers: A Review on Key Strategies for Feature Learning and Performance Optimization |
| title_full_unstemmed | Understanding Video Transformers: A Review on Key Strategies for Feature Learning and Performance Optimization |
| title_short | Understanding Video Transformers: A Review on Key Strategies for Feature Learning and Performance Optimization |
| title_sort | understanding video transformers a review on key strategies for feature learning and performance optimization |
| url | https://spj.science.org/doi/10.34133/icomputing.0143 |
| work_keys_str_mv | AT nanchen understandingvideotransformersareviewonkeystrategiesforfeaturelearningandperformanceoptimization AT tiexu understandingvideotransformersareviewonkeystrategiesforfeaturelearningandperformanceoptimization AT mingruisun understandingvideotransformersareviewonkeystrategiesforfeaturelearningandperformanceoptimization AT chengguiyao understandingvideotransformersareviewonkeystrategiesforfeaturelearningandperformanceoptimization AT dongpingyang understandingvideotransformersareviewonkeystrategiesforfeaturelearningandperformanceoptimization |