Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training

The canonical video action recognition methods usually label categories with numbers or one-hot vectors and train neural networks to classify a fixed set of predefined categories, thereby constraining their ability to recognise complex actions and transferable ability to unseen concepts. In contrast...

Full description

Saved in:

Bibliographic Details
Main Authors:	Qingguo Zhou, Yufeng Hou, Rui Zhou, Yan Li, JinQiang Wang, Zhen Wu, Hung-Wei Li, Tien-Hsiung Weng
Format:	Article
Language:	English
Published:	Taylor & Francis Group 2024-12-01
Series:	Connection Science
Subjects:	Adaptive weight training cross-modal learning video action recognition vision-Language adaptation
Online Access:	https://www.tandfonline.com/doi/10.1080/09540091.2024.2325474
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849691071502614528
author	Qingguo Zhou Yufeng Hou Rui Zhou Yan Li JinQiang Wang Zhen Wu Hung-Wei Li Tien-Hsiung Weng
author_facet	Qingguo Zhou Yufeng Hou Rui Zhou Yan Li JinQiang Wang Zhen Wu Hung-Wei Li Tien-Hsiung Weng
author_sort	Qingguo Zhou
collection	DOAJ
description	The canonical video action recognition methods usually label categories with numbers or one-hot vectors and train neural networks to classify a fixed set of predefined categories, thereby constraining their ability to recognise complex actions and transferable ability to unseen concepts. In contrast, cross-modal learning can improve the performance of individual modalities. Based on the facts that a better action recogniser can be built by reading the statements used to describe actions, we exploited the recent multimodal foundation model CLIP for action recognition. In this study, an effective Vision-Language action recognition adaptation was implemented based on few-shot examples spanning different modalities. We added semantic information to action categories by treating textual and visual label as training examples for action classifier construction rather than simply labelling them with numbers. Due to the different importance of words in text and video frames, simply averaging all sequential features may result in ignoring keywords or key video frames. To capture sequential and hierarchical representation, a weighted token-wise interaction mechanism was employed to exploit the pair-wise correlations adaptively. Extensive experiments with public datasets show that cross-modal action recognition learning helps for downstream action images classification, in other words, the proposed method can train better action classifiers by reading the sentences describing action itself. The method proposed in this study not only reaches good generalisation and zero-shot/few-shot transfer ability on Out of Distribution (OOD) test sets, but also performs lower computational complexity due to the lightweight interaction mechanism with 84.15% Top-1 accuracy on the Kinetics-400.
format	Article
id	doaj-art-e8b24496427948bf9b08d96462eb8a69
institution	DOAJ
issn	0954-0091 1360-0494
language	English
publishDate	2024-12-01
publisher	Taylor & Francis Group
record_format	Article
series	Connection Science
spelling	doaj-art-e8b24496427948bf9b08d96462eb8a692025-08-20T03:21:09ZengTaylor & Francis GroupConnection Science0954-00911360-04942024-12-0136110.1080/09540091.2024.2325474Cross-modal learning with multi-modal model for video action recognition based on adaptive weight trainingQingguo Zhou0Yufeng Hou1Rui Zhou2Yan Li3JinQiang Wang4Zhen Wu5Hung-Wei Li6Tien-Hsiung Weng7School of Information Science and Engineering, Lanzhou University, Lanzhou, People's Republic of ChinaSchool of Information Science and Engineering, Lanzhou University, Lanzhou, People's Republic of ChinaSchool of Information Science and Engineering, Lanzhou University, Lanzhou, People's Republic of ChinaSchool of Information Science and Engineering, Lanzhou University, Lanzhou, People's Republic of ChinaSchool of Information Science and Engineering, Lanzhou University, Lanzhou, People's Republic of ChinaSchool of Information Science and Engineering, Lanzhou University, Lanzhou, People's Republic of ChinaDepartment of Computer Science and Information Engineering, Providence University, Taichung City, TaiwanDepartment of Computer Science and Information Engineering, Providence University, Taichung City, TaiwanThe canonical video action recognition methods usually label categories with numbers or one-hot vectors and train neural networks to classify a fixed set of predefined categories, thereby constraining their ability to recognise complex actions and transferable ability to unseen concepts. In contrast, cross-modal learning can improve the performance of individual modalities. Based on the facts that a better action recogniser can be built by reading the statements used to describe actions, we exploited the recent multimodal foundation model CLIP for action recognition. In this study, an effective Vision-Language action recognition adaptation was implemented based on few-shot examples spanning different modalities. We added semantic information to action categories by treating textual and visual label as training examples for action classifier construction rather than simply labelling them with numbers. Due to the different importance of words in text and video frames, simply averaging all sequential features may result in ignoring keywords or key video frames. To capture sequential and hierarchical representation, a weighted token-wise interaction mechanism was employed to exploit the pair-wise correlations adaptively. Extensive experiments with public datasets show that cross-modal action recognition learning helps for downstream action images classification, in other words, the proposed method can train better action classifiers by reading the sentences describing action itself. The method proposed in this study not only reaches good generalisation and zero-shot/few-shot transfer ability on Out of Distribution (OOD) test sets, but also performs lower computational complexity due to the lightweight interaction mechanism with 84.15% Top-1 accuracy on the Kinetics-400.https://www.tandfonline.com/doi/10.1080/09540091.2024.2325474Adaptive weight trainingcross-modal learningvideo action recognitionvision-Language adaptation
spellingShingle	Qingguo Zhou Yufeng Hou Rui Zhou Yan Li JinQiang Wang Zhen Wu Hung-Wei Li Tien-Hsiung Weng Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training Connection Science Adaptive weight training cross-modal learning video action recognition vision-Language adaptation
title	Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training
title_full	Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training
title_fullStr	Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training
title_full_unstemmed	Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training
title_short	Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training
title_sort	cross modal learning with multi modal model for video action recognition based on adaptive weight training
topic	Adaptive weight training cross-modal learning video action recognition vision-Language adaptation
url	https://www.tandfonline.com/doi/10.1080/09540091.2024.2325474
work_keys_str_mv	AT qingguozhou crossmodallearningwithmultimodalmodelforvideoactionrecognitionbasedonadaptiveweighttraining AT yufenghou crossmodallearningwithmultimodalmodelforvideoactionrecognitionbasedonadaptiveweighttraining AT ruizhou crossmodallearningwithmultimodalmodelforvideoactionrecognitionbasedonadaptiveweighttraining AT yanli crossmodallearningwithmultimodalmodelforvideoactionrecognitionbasedonadaptiveweighttraining AT jinqiangwang crossmodallearningwithmultimodalmodelforvideoactionrecognitionbasedonadaptiveweighttraining AT zhenwu crossmodallearningwithmultimodalmodelforvideoactionrecognitionbasedonadaptiveweighttraining AT hungweili crossmodallearningwithmultimodalmodelforvideoactionrecognitionbasedonadaptiveweighttraining AT tienhsiungweng crossmodallearningwithmultimodalmodelforvideoactionrecognitionbasedonadaptiveweighttraining

Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training

Similar Items