Temporal Pyramid Alignment and Adaptive Fusion of Event Stream and Image Frame for Keypoint Detection and Tracking in Autonomous Driving

This paper proposes a method to address the alignment and fusion challenges in multimodal fusion between event and RGB cameras. For multimodal alignment, we adopt the Temporal Pyramid Alignment mechanism to achieve multi-scale temporal synchronization of event streams and RGB frames. For multimodal...

Full description

Saved in:
Bibliographic Details
Main Authors: Peijun Shi, Chee-Onn Chow, Wei Ru Wong
Format: Article
Language:English
Published: Elsevier 2025-08-01
Series:Alexandria Engineering Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S1110016825005940
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This paper proposes a method to address the alignment and fusion challenges in multimodal fusion between event and RGB cameras. For multimodal alignment, we adopt the Temporal Pyramid Alignment mechanism to achieve multi-scale temporal synchronization of event streams and RGB frames. For multimodal fusion, we design a module that employs adaptive fusion to dynamically adjust the contribution of each modality based on scene complexity and feature quality. A gating network computes fusion weights by considering both relative modality importance and noise characteristics. A Cross-Modal Feature Compensation module is integrated into the framework to enhance information utilization. Additionally, the framework incorporates a Dynamic Inference Path Selection mechanism, guided by input complexity, to optimize computational resource allocation, along with a dynamic noise suppression mechanism to improve the robustness of feature extraction. Experimental results on the DSEC dataset demonstrate that the proposed method achieves a 36.9% mAP and 40.1% tracking success rate, particularly effective in extreme lighting and fast motion scenarios, surpassing existing approaches by 1.8% mAP and 1.6% SR, while maintaining real-time efficiency at 13.1 FPS. This work provides an important solution for applications in autonomous driving, robotics, and augmented reality, where robust multimodal perception under dynamic conditions is critical.
ISSN:1110-0168