Coarse-Fine Tracker: A Robust MOT Framework for Satellite Videos via Tracking Any Point
Traditional Multiple Object Tracking (MOT) methods in satellite videos mostly follow the Detection-Based Tracking (DBT) framework. However, the DBT framework assumes that all objects are correctly recognized and localized by the detector. In practice, the low resolution of satellite videos, small ob...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-06-01
|
| Series: | Remote Sensing |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2072-4292/17/13/2167 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849428743869693952 |
|---|---|
| author | Hanru Shi Xiaoxuan Liu Xiyu Qi Enze Zhu Jie Jia Lei Wang |
| author_facet | Hanru Shi Xiaoxuan Liu Xiyu Qi Enze Zhu Jie Jia Lei Wang |
| author_sort | Hanru Shi |
| collection | DOAJ |
| description | Traditional Multiple Object Tracking (MOT) methods in satellite videos mostly follow the Detection-Based Tracking (DBT) framework. However, the DBT framework assumes that all objects are correctly recognized and localized by the detector. In practice, the low resolution of satellite videos, small objects, and complex backgrounds inevitably leads to a decline in detector performance. To alleviate the impact of detector degradation on track, we propose Coarse-Fine Tracker, a framework that integrates the MOT framework with the Tracking Any Point (TAP) method CoTracker for the first time, leveraging TAP’s persistent point correspondence modeling to compensate for detector failures. In our Coarse-Fine Tracker, we divide the satellite video into sub-videos. For one sub-video, we first use ByteTrack to track the outputs of the detector, referred to as coarse tracking, which involves the Kalman filter and box-level motion features. Given the small size of objects in satellite videos, we treat each object as a point to be tracked. We then use CoTracker to track the center point of each object, referred to as fine tracking, by calculating the appearance feature similarity between each point and its neighboring points. Finally, the Consensus Fusion Strategy eliminates mismatched detections in coarse tracking results by checking their geometric consistency against fine tracking results and recovers missed objects via linear interpolation or linear fitting. This method is validated on the VISO and SAT-MTB datasets. Experimental results in VISO show that the tracker achieves a multi-object tracking accuracy (MOTA) of 66.9, a multi-object tracking precision (MOTP) of 64.1, and an IDF1 score of 77.8, surpassing the detector-only baseline by 11.1% in MOTA while reducing ID switches by 139. Comparative experiments with ByteTrack demonstrate the robustness of our tracking method when the performance of the detector deteriorates. |
| format | Article |
| id | doaj-art-3ffd02b076ee4dac87c0fbd88b7eca48 |
| institution | Kabale University |
| issn | 2072-4292 |
| language | English |
| publishDate | 2025-06-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Remote Sensing |
| spelling | doaj-art-3ffd02b076ee4dac87c0fbd88b7eca482025-08-20T03:28:37ZengMDPI AGRemote Sensing2072-42922025-06-011713216710.3390/rs17132167Coarse-Fine Tracker: A Robust MOT Framework for Satellite Videos via Tracking Any PointHanru Shi0Xiaoxuan Liu1Xiyu Qi2Enze Zhu3Jie Jia4Lei Wang5Key Laboratory of Target Cognition and Application Technology (TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, ChinaKey Laboratory of Target Cognition and Application Technology (TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, ChinaKey Laboratory of Target Cognition and Application Technology (TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, ChinaKey Laboratory of Target Cognition and Application Technology (TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, ChinaKey Laboratory of Target Cognition and Application Technology (TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, ChinaKey Laboratory of Target Cognition and Application Technology (TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, ChinaTraditional Multiple Object Tracking (MOT) methods in satellite videos mostly follow the Detection-Based Tracking (DBT) framework. However, the DBT framework assumes that all objects are correctly recognized and localized by the detector. In practice, the low resolution of satellite videos, small objects, and complex backgrounds inevitably leads to a decline in detector performance. To alleviate the impact of detector degradation on track, we propose Coarse-Fine Tracker, a framework that integrates the MOT framework with the Tracking Any Point (TAP) method CoTracker for the first time, leveraging TAP’s persistent point correspondence modeling to compensate for detector failures. In our Coarse-Fine Tracker, we divide the satellite video into sub-videos. For one sub-video, we first use ByteTrack to track the outputs of the detector, referred to as coarse tracking, which involves the Kalman filter and box-level motion features. Given the small size of objects in satellite videos, we treat each object as a point to be tracked. We then use CoTracker to track the center point of each object, referred to as fine tracking, by calculating the appearance feature similarity between each point and its neighboring points. Finally, the Consensus Fusion Strategy eliminates mismatched detections in coarse tracking results by checking their geometric consistency against fine tracking results and recovers missed objects via linear interpolation or linear fitting. This method is validated on the VISO and SAT-MTB datasets. Experimental results in VISO show that the tracker achieves a multi-object tracking accuracy (MOTA) of 66.9, a multi-object tracking precision (MOTP) of 64.1, and an IDF1 score of 77.8, surpassing the detector-only baseline by 11.1% in MOTA while reducing ID switches by 139. Comparative experiments with ByteTrack demonstrate the robustness of our tracking method when the performance of the detector deteriorates.https://www.mdpi.com/2072-4292/17/13/2167multiple object trackingdetection-based trackingtracking any point |
| spellingShingle | Hanru Shi Xiaoxuan Liu Xiyu Qi Enze Zhu Jie Jia Lei Wang Coarse-Fine Tracker: A Robust MOT Framework for Satellite Videos via Tracking Any Point Remote Sensing multiple object tracking detection-based tracking tracking any point |
| title | Coarse-Fine Tracker: A Robust MOT Framework for Satellite Videos via Tracking Any Point |
| title_full | Coarse-Fine Tracker: A Robust MOT Framework for Satellite Videos via Tracking Any Point |
| title_fullStr | Coarse-Fine Tracker: A Robust MOT Framework for Satellite Videos via Tracking Any Point |
| title_full_unstemmed | Coarse-Fine Tracker: A Robust MOT Framework for Satellite Videos via Tracking Any Point |
| title_short | Coarse-Fine Tracker: A Robust MOT Framework for Satellite Videos via Tracking Any Point |
| title_sort | coarse fine tracker a robust mot framework for satellite videos via tracking any point |
| topic | multiple object tracking detection-based tracking tracking any point |
| url | https://www.mdpi.com/2072-4292/17/13/2167 |
| work_keys_str_mv | AT hanrushi coarsefinetrackerarobustmotframeworkforsatellitevideosviatrackinganypoint AT xiaoxuanliu coarsefinetrackerarobustmotframeworkforsatellitevideosviatrackinganypoint AT xiyuqi coarsefinetrackerarobustmotframeworkforsatellitevideosviatrackinganypoint AT enzezhu coarsefinetrackerarobustmotframeworkforsatellitevideosviatrackinganypoint AT jiejia coarsefinetrackerarobustmotframeworkforsatellitevideosviatrackinganypoint AT leiwang coarsefinetrackerarobustmotframeworkforsatellitevideosviatrackinganypoint |