Parameter-efficient weakly supervised referring video object segmentation via chain-of-thought reasoning

Abstract Referring video object segmentation (RVOS) aims to segment the object corresponding to a language expression in a video. Most existing RVOS methods are trained using accurate per-pixel annotations, which are expensive and time-consuming to obtain. Moreover, they need to update the entire pa...

Full description

Saved in:

Bibliographic Details
Main Authors:	Xing Wang, Zhe Xu, Yuanshi Zheng, Handing Wang
Format:	Article
Language:	English
Published:	Springer 2025-05-01
Series:	Complex & Intelligent Systems
Subjects:	Referring video object segmentation Weakly supervised learning Chain-of-thought reasoning
Online Access:	https://doi.org/10.1007/s40747-025-01900-1
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Abstract Referring video object segmentation (RVOS) aims to segment the object corresponding to a language expression in a video. Most existing RVOS methods are trained using accurate per-pixel annotations, which are expensive and time-consuming to obtain. Moreover, they need to update the entire parameter of a segmentation model, making it inefficient to train as the model scale increases. In this paper, we propose a novel parameter-efficient framework under weak supervision, dubbed ReferringAdapter, to ameliorate both of issues. Specifically, we propose to adapt an off-the-shelf image segmentation model for RVOS by plugging a small set of trained parameters, i.e., an adapter, into the intermediate layer. This efficiently endows a uni-modal image segmentation model with the cross-modal ability to segment the video object referred by a language expression. To update the adapter parameters under weak supervision, instead of directly fuse the video and sentence-level language features, we propose chain-of-thought reasoning to consider the intermediate steps along the thought process. Extensive experiments demonstrate that training the adapter with 1.1% of total parameters can outperform previous weakly supervised methods by 11.6 $$-$$ - 15.3 mAP and achieve comparable performance with fully supervised ones.
ISSN:	2199-4536 2198-6053

Parameter-efficient weakly supervised referring video object segmentation via chain-of-thought reasoning

Similar Items