Visual tracking by matching points using diffusion model

Existing Siamese and Transformer-based trackers commonly approach Visual Object Tracking (VOT) as a one-shot detection problem, where the target object is located in a single forward evaluation. While effective, this approach lacks self-correction mechanisms, making these trackers prone to drifting...

Full description

Saved in:

Bibliographic Details
Main Authors:	Mohamad Alansari, Iyyakutti Iyappan Ganapathi, Sara Alansari, Hasan Al Marzouqi, Sajid Javed
Format:	Article
Language:	English
Published:	Elsevier 2025-08-01
Series:	Alexandria Engineering Journal
Subjects:	Diffusion models Segmentation Segment Anything 2 (SAM2) Visual Object Tracking (VOT)
Online Access:	http://www.sciencedirect.com/science/article/pii/S1110016825007914
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849229322590617600
author	Mohamad Alansari Iyyakutti Iyappan Ganapathi Sara Alansari Hasan Al Marzouqi Sajid Javed
author_facet	Mohamad Alansari Iyyakutti Iyappan Ganapathi Sara Alansari Hasan Al Marzouqi Sajid Javed
author_sort	Mohamad Alansari
collection	DOAJ
description	Existing Siamese and Transformer-based trackers commonly approach Visual Object Tracking (VOT) as a one-shot detection problem, where the target object is located in a single forward evaluation. While effective, this approach lacks self-correction mechanisms, making these trackers prone to drifting toward visually similar distractors. To overcome these limitations, we reframe VOT as a spatio-temporal region prediction and segmentation task. In this work, we propose Stable-SAM2, a novel two-stage tracking framework that combines spatio-temporal region prediction with segmentation. In the first stage, we optimize the text embeddings in Stable Diffusion to enforce consistent attention to the target’s spatio-temporal regions by maximizing the cross-attention responses at the query location across frames. These optimized embeddings are used to generate spatio-temporal attention maps, highlighting the target object across video frames. In the second stage, the predicted regions are input into the Segment Anything Model 2 (SAM2), which refines them into accurate per-frame segmentation masks. These masks are then converted into bounding boxes to facilitate VOT. We evaluate Stable-SAM2 on six widely recognized and diverse benchmarks, including LaSOT, LaSOText, TrackingNet, TNL2K, OTB99-Lang, and GOT-10k. Extensive experiments demonstrate that Stable-SAM2 delivers superior and competitive performance compared to supervised state-of-the-art (SOTA) trackers, all without relying on complex VOT-specific training paradigms or large-scale training datasets. The source code of the Stable-SAM2 is publicly available at: https://github.com/HamadYA/Stable-SAM2.
format	Article
id	doaj-art-9e29d21bfa3e4cb3bf3a2c7004f61291
institution	Kabale University
issn	1110-0168
language	English
publishDate	2025-08-01
publisher	Elsevier
record_format	Article
series	Alexandria Engineering Journal
spelling	doaj-art-9e29d21bfa3e4cb3bf3a2c7004f612912025-08-22T04:55:35ZengElsevierAlexandria Engineering Journal1110-01682025-08-0112778780310.1016/j.aej.2025.06.042Visual tracking by matching points using diffusion modelMohamad Alansari0Iyyakutti Iyappan Ganapathi1Sara Alansari2Hasan Al Marzouqi3Sajid Javed4Corresponding author.; Department of Computer Science, Khalifa University, Abu Dhabi, United Arab EmiratesDepartment of Computer Science, Khalifa University, Abu Dhabi, United Arab EmiratesDepartment of Computer Science, Khalifa University, Abu Dhabi, United Arab EmiratesDepartment of Computer Science, Khalifa University, Abu Dhabi, United Arab EmiratesDepartment of Computer Science, Khalifa University, Abu Dhabi, United Arab EmiratesExisting Siamese and Transformer-based trackers commonly approach Visual Object Tracking (VOT) as a one-shot detection problem, where the target object is located in a single forward evaluation. While effective, this approach lacks self-correction mechanisms, making these trackers prone to drifting toward visually similar distractors. To overcome these limitations, we reframe VOT as a spatio-temporal region prediction and segmentation task. In this work, we propose Stable-SAM2, a novel two-stage tracking framework that combines spatio-temporal region prediction with segmentation. In the first stage, we optimize the text embeddings in Stable Diffusion to enforce consistent attention to the target’s spatio-temporal regions by maximizing the cross-attention responses at the query location across frames. These optimized embeddings are used to generate spatio-temporal attention maps, highlighting the target object across video frames. In the second stage, the predicted regions are input into the Segment Anything Model 2 (SAM2), which refines them into accurate per-frame segmentation masks. These masks are then converted into bounding boxes to facilitate VOT. We evaluate Stable-SAM2 on six widely recognized and diverse benchmarks, including LaSOT, LaSOText, TrackingNet, TNL2K, OTB99-Lang, and GOT-10k. Extensive experiments demonstrate that Stable-SAM2 delivers superior and competitive performance compared to supervised state-of-the-art (SOTA) trackers, all without relying on complex VOT-specific training paradigms or large-scale training datasets. The source code of the Stable-SAM2 is publicly available at: https://github.com/HamadYA/Stable-SAM2.http://www.sciencedirect.com/science/article/pii/S1110016825007914Diffusion modelsSegmentationSegment Anything 2 (SAM2)Visual Object Tracking (VOT)
spellingShingle	Mohamad Alansari Iyyakutti Iyappan Ganapathi Sara Alansari Hasan Al Marzouqi Sajid Javed Visual tracking by matching points using diffusion model Alexandria Engineering Journal Diffusion models Segmentation Segment Anything 2 (SAM2) Visual Object Tracking (VOT)
title	Visual tracking by matching points using diffusion model
title_full	Visual tracking by matching points using diffusion model
title_fullStr	Visual tracking by matching points using diffusion model
title_full_unstemmed	Visual tracking by matching points using diffusion model
title_short	Visual tracking by matching points using diffusion model
title_sort	visual tracking by matching points using diffusion model
topic	Diffusion models Segmentation Segment Anything 2 (SAM2) Visual Object Tracking (VOT)
url	http://www.sciencedirect.com/science/article/pii/S1110016825007914
work_keys_str_mv	AT mohamadalansari visualtrackingbymatchingpointsusingdiffusionmodel AT iyyakuttiiyappanganapathi visualtrackingbymatchingpointsusingdiffusionmodel AT saraalansari visualtrackingbymatchingpointsusingdiffusionmodel AT hasanalmarzouqi visualtrackingbymatchingpointsusingdiffusionmodel AT sajidjaved visualtrackingbymatchingpointsusingdiffusionmodel

Visual tracking by matching points using diffusion model

Similar Items