Visual tracking by matching points using diffusion model

Existing Siamese and Transformer-based trackers commonly approach Visual Object Tracking (VOT) as a one-shot detection problem, where the target object is located in a single forward evaluation. While effective, this approach lacks self-correction mechanisms, making these trackers prone to drifting...

Full description

Saved in:
Bibliographic Details
Main Authors: Mohamad Alansari, Iyyakutti Iyappan Ganapathi, Sara Alansari, Hasan Al Marzouqi, Sajid Javed
Format: Article
Language:English
Published: Elsevier 2025-08-01
Series:Alexandria Engineering Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S1110016825007914
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849229322590617600
author Mohamad Alansari
Iyyakutti Iyappan Ganapathi
Sara Alansari
Hasan Al Marzouqi
Sajid Javed
author_facet Mohamad Alansari
Iyyakutti Iyappan Ganapathi
Sara Alansari
Hasan Al Marzouqi
Sajid Javed
author_sort Mohamad Alansari
collection DOAJ
description Existing Siamese and Transformer-based trackers commonly approach Visual Object Tracking (VOT) as a one-shot detection problem, where the target object is located in a single forward evaluation. While effective, this approach lacks self-correction mechanisms, making these trackers prone to drifting toward visually similar distractors. To overcome these limitations, we reframe VOT as a spatio-temporal region prediction and segmentation task. In this work, we propose Stable-SAM2, a novel two-stage tracking framework that combines spatio-temporal region prediction with segmentation. In the first stage, we optimize the text embeddings in Stable Diffusion to enforce consistent attention to the target’s spatio-temporal regions by maximizing the cross-attention responses at the query location across frames. These optimized embeddings are used to generate spatio-temporal attention maps, highlighting the target object across video frames. In the second stage, the predicted regions are input into the Segment Anything Model 2 (SAM2), which refines them into accurate per-frame segmentation masks. These masks are then converted into bounding boxes to facilitate VOT. We evaluate Stable-SAM2 on six widely recognized and diverse benchmarks, including LaSOT, LaSOText, TrackingNet, TNL2K, OTB99-Lang, and GOT-10k. Extensive experiments demonstrate that Stable-SAM2 delivers superior and competitive performance compared to supervised state-of-the-art (SOTA) trackers, all without relying on complex VOT-specific training paradigms or large-scale training datasets. The source code of the Stable-SAM2 is publicly available at: https://github.com/HamadYA/Stable-SAM2.
format Article
id doaj-art-9e29d21bfa3e4cb3bf3a2c7004f61291
institution Kabale University
issn 1110-0168
language English
publishDate 2025-08-01
publisher Elsevier
record_format Article
series Alexandria Engineering Journal
spelling doaj-art-9e29d21bfa3e4cb3bf3a2c7004f612912025-08-22T04:55:35ZengElsevierAlexandria Engineering Journal1110-01682025-08-0112778780310.1016/j.aej.2025.06.042Visual tracking by matching points using diffusion modelMohamad Alansari0Iyyakutti Iyappan Ganapathi1Sara Alansari2Hasan Al Marzouqi3Sajid Javed4Corresponding author.; Department of Computer Science, Khalifa University, Abu Dhabi, United Arab EmiratesDepartment of Computer Science, Khalifa University, Abu Dhabi, United Arab EmiratesDepartment of Computer Science, Khalifa University, Abu Dhabi, United Arab EmiratesDepartment of Computer Science, Khalifa University, Abu Dhabi, United Arab EmiratesDepartment of Computer Science, Khalifa University, Abu Dhabi, United Arab EmiratesExisting Siamese and Transformer-based trackers commonly approach Visual Object Tracking (VOT) as a one-shot detection problem, where the target object is located in a single forward evaluation. While effective, this approach lacks self-correction mechanisms, making these trackers prone to drifting toward visually similar distractors. To overcome these limitations, we reframe VOT as a spatio-temporal region prediction and segmentation task. In this work, we propose Stable-SAM2, a novel two-stage tracking framework that combines spatio-temporal region prediction with segmentation. In the first stage, we optimize the text embeddings in Stable Diffusion to enforce consistent attention to the target’s spatio-temporal regions by maximizing the cross-attention responses at the query location across frames. These optimized embeddings are used to generate spatio-temporal attention maps, highlighting the target object across video frames. In the second stage, the predicted regions are input into the Segment Anything Model 2 (SAM2), which refines them into accurate per-frame segmentation masks. These masks are then converted into bounding boxes to facilitate VOT. We evaluate Stable-SAM2 on six widely recognized and diverse benchmarks, including LaSOT, LaSOText, TrackingNet, TNL2K, OTB99-Lang, and GOT-10k. Extensive experiments demonstrate that Stable-SAM2 delivers superior and competitive performance compared to supervised state-of-the-art (SOTA) trackers, all without relying on complex VOT-specific training paradigms or large-scale training datasets. The source code of the Stable-SAM2 is publicly available at: https://github.com/HamadYA/Stable-SAM2.http://www.sciencedirect.com/science/article/pii/S1110016825007914Diffusion modelsSegmentationSegment Anything 2 (SAM2)Visual Object Tracking (VOT)
spellingShingle Mohamad Alansari
Iyyakutti Iyappan Ganapathi
Sara Alansari
Hasan Al Marzouqi
Sajid Javed
Visual tracking by matching points using diffusion model
Alexandria Engineering Journal
Diffusion models
Segmentation
Segment Anything 2 (SAM2)
Visual Object Tracking (VOT)
title Visual tracking by matching points using diffusion model
title_full Visual tracking by matching points using diffusion model
title_fullStr Visual tracking by matching points using diffusion model
title_full_unstemmed Visual tracking by matching points using diffusion model
title_short Visual tracking by matching points using diffusion model
title_sort visual tracking by matching points using diffusion model
topic Diffusion models
Segmentation
Segment Anything 2 (SAM2)
Visual Object Tracking (VOT)
url http://www.sciencedirect.com/science/article/pii/S1110016825007914
work_keys_str_mv AT mohamadalansari visualtrackingbymatchingpointsusingdiffusionmodel
AT iyyakuttiiyappanganapathi visualtrackingbymatchingpointsusingdiffusionmodel
AT saraalansari visualtrackingbymatchingpointsusingdiffusionmodel
AT hasanalmarzouqi visualtrackingbymatchingpointsusingdiffusionmodel
AT sajidjaved visualtrackingbymatchingpointsusingdiffusionmodel