Visual tracking by matching points using diffusion model
Existing Siamese and Transformer-based trackers commonly approach Visual Object Tracking (VOT) as a one-shot detection problem, where the target object is located in a single forward evaluation. While effective, this approach lacks self-correction mechanisms, making these trackers prone to drifting...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-08-01
|
| Series: | Alexandria Engineering Journal |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S1110016825007914 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849229322590617600 |
|---|---|
| author | Mohamad Alansari Iyyakutti Iyappan Ganapathi Sara Alansari Hasan Al Marzouqi Sajid Javed |
| author_facet | Mohamad Alansari Iyyakutti Iyappan Ganapathi Sara Alansari Hasan Al Marzouqi Sajid Javed |
| author_sort | Mohamad Alansari |
| collection | DOAJ |
| description | Existing Siamese and Transformer-based trackers commonly approach Visual Object Tracking (VOT) as a one-shot detection problem, where the target object is located in a single forward evaluation. While effective, this approach lacks self-correction mechanisms, making these trackers prone to drifting toward visually similar distractors. To overcome these limitations, we reframe VOT as a spatio-temporal region prediction and segmentation task. In this work, we propose Stable-SAM2, a novel two-stage tracking framework that combines spatio-temporal region prediction with segmentation. In the first stage, we optimize the text embeddings in Stable Diffusion to enforce consistent attention to the target’s spatio-temporal regions by maximizing the cross-attention responses at the query location across frames. These optimized embeddings are used to generate spatio-temporal attention maps, highlighting the target object across video frames. In the second stage, the predicted regions are input into the Segment Anything Model 2 (SAM2), which refines them into accurate per-frame segmentation masks. These masks are then converted into bounding boxes to facilitate VOT. We evaluate Stable-SAM2 on six widely recognized and diverse benchmarks, including LaSOT, LaSOText, TrackingNet, TNL2K, OTB99-Lang, and GOT-10k. Extensive experiments demonstrate that Stable-SAM2 delivers superior and competitive performance compared to supervised state-of-the-art (SOTA) trackers, all without relying on complex VOT-specific training paradigms or large-scale training datasets. The source code of the Stable-SAM2 is publicly available at: https://github.com/HamadYA/Stable-SAM2. |
| format | Article |
| id | doaj-art-9e29d21bfa3e4cb3bf3a2c7004f61291 |
| institution | Kabale University |
| issn | 1110-0168 |
| language | English |
| publishDate | 2025-08-01 |
| publisher | Elsevier |
| record_format | Article |
| series | Alexandria Engineering Journal |
| spelling | doaj-art-9e29d21bfa3e4cb3bf3a2c7004f612912025-08-22T04:55:35ZengElsevierAlexandria Engineering Journal1110-01682025-08-0112778780310.1016/j.aej.2025.06.042Visual tracking by matching points using diffusion modelMohamad Alansari0Iyyakutti Iyappan Ganapathi1Sara Alansari2Hasan Al Marzouqi3Sajid Javed4Corresponding author.; Department of Computer Science, Khalifa University, Abu Dhabi, United Arab EmiratesDepartment of Computer Science, Khalifa University, Abu Dhabi, United Arab EmiratesDepartment of Computer Science, Khalifa University, Abu Dhabi, United Arab EmiratesDepartment of Computer Science, Khalifa University, Abu Dhabi, United Arab EmiratesDepartment of Computer Science, Khalifa University, Abu Dhabi, United Arab EmiratesExisting Siamese and Transformer-based trackers commonly approach Visual Object Tracking (VOT) as a one-shot detection problem, where the target object is located in a single forward evaluation. While effective, this approach lacks self-correction mechanisms, making these trackers prone to drifting toward visually similar distractors. To overcome these limitations, we reframe VOT as a spatio-temporal region prediction and segmentation task. In this work, we propose Stable-SAM2, a novel two-stage tracking framework that combines spatio-temporal region prediction with segmentation. In the first stage, we optimize the text embeddings in Stable Diffusion to enforce consistent attention to the target’s spatio-temporal regions by maximizing the cross-attention responses at the query location across frames. These optimized embeddings are used to generate spatio-temporal attention maps, highlighting the target object across video frames. In the second stage, the predicted regions are input into the Segment Anything Model 2 (SAM2), which refines them into accurate per-frame segmentation masks. These masks are then converted into bounding boxes to facilitate VOT. We evaluate Stable-SAM2 on six widely recognized and diverse benchmarks, including LaSOT, LaSOText, TrackingNet, TNL2K, OTB99-Lang, and GOT-10k. Extensive experiments demonstrate that Stable-SAM2 delivers superior and competitive performance compared to supervised state-of-the-art (SOTA) trackers, all without relying on complex VOT-specific training paradigms or large-scale training datasets. The source code of the Stable-SAM2 is publicly available at: https://github.com/HamadYA/Stable-SAM2.http://www.sciencedirect.com/science/article/pii/S1110016825007914Diffusion modelsSegmentationSegment Anything 2 (SAM2)Visual Object Tracking (VOT) |
| spellingShingle | Mohamad Alansari Iyyakutti Iyappan Ganapathi Sara Alansari Hasan Al Marzouqi Sajid Javed Visual tracking by matching points using diffusion model Alexandria Engineering Journal Diffusion models Segmentation Segment Anything 2 (SAM2) Visual Object Tracking (VOT) |
| title | Visual tracking by matching points using diffusion model |
| title_full | Visual tracking by matching points using diffusion model |
| title_fullStr | Visual tracking by matching points using diffusion model |
| title_full_unstemmed | Visual tracking by matching points using diffusion model |
| title_short | Visual tracking by matching points using diffusion model |
| title_sort | visual tracking by matching points using diffusion model |
| topic | Diffusion models Segmentation Segment Anything 2 (SAM2) Visual Object Tracking (VOT) |
| url | http://www.sciencedirect.com/science/article/pii/S1110016825007914 |
| work_keys_str_mv | AT mohamadalansari visualtrackingbymatchingpointsusingdiffusionmodel AT iyyakuttiiyappanganapathi visualtrackingbymatchingpointsusingdiffusionmodel AT saraalansari visualtrackingbymatchingpointsusingdiffusionmodel AT hasanalmarzouqi visualtrackingbymatchingpointsusingdiffusionmodel AT sajidjaved visualtrackingbymatchingpointsusingdiffusionmodel |