Enhanced YOLOv12 Through Sliced Contrastive Supervision and Full Scene Fine-Tuning
Real-time detection of objects in drone-based imagery has proven to be a challenging task for even the most state-of-the-art deep learning models. Due to computational limitations, images are often scaled down during training, reducing the feature space and leading to decreased overall accuracy duri...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11113267/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Real-time detection of objects in drone-based imagery has proven to be a challenging task for even the most state-of-the-art deep learning models. Due to computational limitations, images are often scaled down during training, reducing the feature space and leading to decreased overall accuracy during validation and inference. This work proposes a two-stage training strategy and several key improvements to one of the most recent You Only Look Once (YOLO) models, YOLOv12. An extra P2 branch and its corresponding scale were added to the head of the network to improve detection of small-scale objects. An additional CIoU-like penalty term was combined with the standard CIoU-based loss used in YOLO to improve detection accuracy. Finally, a contrastive loss function and an associated embedding branch were introduced to help discriminate between features in the embedding space, grouping instances of the same classes closer together and instances of different objects further apart. The first stage of training leverages these improvements on a sliced, full-resolution version of the VisDrone2019-DET dataset, which maintains 15% overlap between slices, and the second stage continues to leverage these improvements for fine-tuning using the full images in a scaled down configuration to provide full scene context. Results demonstrate a 35.5% Mean Average Precision (<inline-formula> <tex-math notation="LaTeX">$mAP_{50:95}$ </tex-math></inline-formula>) and a 56.6% mAP50 on the validation split. On the test split, results demonstrate a 36.2% <inline-formula> <tex-math notation="LaTeX">$mAP_{50:95}$ </tex-math></inline-formula> and a 57.7% mAP50. |
|---|---|
| ISSN: | 2169-3536 |